Similarity encoding for learning with dirty categorical variables
Abstract
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with onehot encoding. “Dirty” noncurated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of onehot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on noncurated tables, a problem seldom studied in machine learning. Results on seven realworld datasets show that similarity encoding brings significant gains in predictive performance in comparison with known encoding methods for categories or strings, notably onehot encoding and bag of character ngrams. We draw practical recommendations for encoding dirty categories: 3gram similarity appears to be a good choice to capture morphological resemblance. For very highcardinalities, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperform classic encoding approaches.
Keywords
Dirty data Categorical variables Statistical learning String similarity measures1 Introduction
Many statistical learning algorithms require as input a numerical feature matrix. When categorical variables are present in the data, feature engineering is needed to encode the different categories into a suitable feature vector.^{1} Onehot encoding is a simple and widelyused encoding method (Alkharusi 2012; Berry et al. 1998; Cohen et al. 2013; Davis 2010; Pedhazur and Kerlinger 1973; Myers et al. 2010; O’Grady and Medoff 1988). For example, a categorical variable having as categories {female, male, other} can be encoded respectively with 3dimensional feature vectors: {[1, 0, 0], [0, 1, 0], [0, 0, 1]}. In the resulting vector space, each category is orthogonal and equidistant to the others, which agrees with classical intuitions about nominal categorical variables.
Noncurated categorical data often lead to larger cardinality of the categorical variable and give rise to several problems when using onehot encoding. A first challenge is that the dataset may contain different morphological representations of the same category. For instance, for a categorical variable named company, it is not clear if ‘Pfizer International LLC’, ‘Pfizer Limited’, and ‘Pfizer Korea’ are different names for the same entity, but they are probably related. Here we build upon the intuition that these entities should be closer in the feature space than unrelated categories, e.g., ‘Sanofi Inc.’. In dirty data, errors such as typos can cause morphological variations of the categories.^{2} Without data cleaning, different string representations of the same category will lead to completely different encoded vectors. Another related challenge is that of encoding categories that do not appear in the training set. Finally, with highcardinality categorical variables, onehot encoding can become impracticable due the highdimensional feature matrix it creates.
Beyond onehot encoding, the statisticallearning literature has considered other categorical encoding methods (Duch et al. 2000; Grabczewski and Jankowski 2003; MicciBarreca 2001; Shyu et al. 2005; Weinberger et al. 2009), but, in general, they do not consider the problem of encoding in the presence of errors, nor how to encode categories absent from the training set.
From a dataintegration standpoint, dirty categories may be seen as a data cleaning problem, addressed, for instance, with entity resolution. Indeed, databasecleaning research has developed many approaches to curate categories (Pyle 1999; Rahm and Do 2000). Tasks such as deduplication or record linkage strive to recognize different variants of the same entity. A classic approach to learning with dirty categories would be to apply them as a preprocessing step and then proceed with standard categorical encoding. Yet, for the specific case of supervised learning, such an approach is suboptimal for two reasons. First, the uncertainty on the entity merging is not exposed to the statistical model. Second, the statistical objective function used during learning is not used to guide the entity resolution. Merging entities is a difficult problem. We build from the assumption that it may not be necessary to solve it, and that simply exposing similarities is enough.
In this paper, we study prediction with highcardinality categorical variables. We seek a simple featureengineering approach to replace the widely used onehot encoding method. The problem of dirty categories has not received much attention in the statisticallearning literature—though it is related to database cleaning research (Krishnan et al. 2016, 2017). To ground it in supervisedlearning settings, we introduce benchmarks on seven realworld datasets that contain at least one textual categorical variable with a high cardinality. The goal of this paper is to stress the importance of adapting encoding schemes to dirty categories by showing that a simple scheme based on string similarities brings important practical gains. In Sect. 2 we describe the problem of dirty categorical data and its impact on encoding approaches. In Sect. 3, we describe in detail common encoding approaches for categorical variables, as well as related techniques in database cleaning—record linkage, deduplication—and in natural language processing (NLP). Then, we propose in Sect. 4 a softer version of onehot encoding, based on string similarity measures. We call this generalization similarity encoding, as it encodes the morphological resemblance between categories. We also present dimensionality reduction approaches that decrease the run time of the statistical learning task. Finally, we show in Sect. 5 the results of a thorough empirical study to evaluate encoding methods on dirty categories. On average, similarity encoding with 3gram distance is the method that has the best results in terms of prediction score, outperforming onehot encoding even when applying strong dimensionality reduction.
2 Problem setting: nonstandardized categorical variables
In a classical statistical data analysis problem, a categorical variable is typically defined as a variable with values—categories—of either a nominal or ordinal nature. For example, place of birth is a nominal categorical variable. Conversely, answers in the Likert scale to the question: ‘Do you agree with this statement: A child’s education is the responsability of parents, not the school system.’, compose an ordinal categorical variable in which the level of agreement is associated with a numerical value. In addition, given a prediction problem, variables can be either the target variable (also known as the dependent or response variable) or an explanatory variable (a feature or independent variable). In this work, we focus on the general problem of nominal categorical variables that are part of the feature set.
In controlled datacollection settings, categorical variables are standardized: the set of categories is finite and known a priori—independently from the data—and categories are mutually exclusive. Typical machinelearning benchmark datasets, as in UCI Machine Learning Repository, use standardized categories. For instance, in the Adult dataset^{3} the occupation of individuals is described with 14 predefined categories in both the training and testing set.
Entities containing the word Pfizer in the variable company name of the open payments dataset (year 2013)
Company name  Frequency 

Pfizer Inc.  79,073 
Pfizer Pharmaceuticals LLC  486 
Pfizer International LLC  425 
Pfizer Limited  13 
Pfizer Corporation Hong Kong Limited  4 
Pfizer Pharmaceuticals Korea Limited  3 
This type of data poses a problem from the point of view of the statistical analysis because we do not know a priori, without external expert information, which of these categories refer to the exact same company or whether all of them have slight differences and hence should be considered as different entities. Also, we can observe that the frequency of the different categories varies by several orders of magnitude, which could imply that errors in the data collection process have been made, unintentionally or not.

Typographical errors (e.g., proffesor instead of professor)

Extraneous data (e.g., name and title, instead of just the name)

Abbreviations (e.g., Dr. for doctor)

Aliases (e.g., Ringo Starr instead Richard Starkey)

Encoding formats (e.g., ASCII, EBCDIC, etc.)

Uses of special characters (space, colon, dash, parenthesis, etc.)

Concatenated hierarchical data (e.g., state–county–city vs. state–city)
3 Related work and common practice
Most of the literature on encoding categorical variables relies on the idea that the set of categories is finite, known a priori, and composed of mutually exclusive elements (Cohen et al. 2013). Some studies have considered encoding highcardinality categorical variables (MicciBarreca 2001; Guo and Berkhahn 2016), but not the problem of dirty data. Nevertheless, efforts on this issue have been made in other areas such as Natural Language Processing and Record Linkage, although they have not been applied to encode categorical variables. Below we summarize the main relevant approaches.
Notation: we write sets of elements with capital curly fonts, as \(\mathcal {X}\). Elements of a vector space are written in bold \(\mathbf {x}\), and matrices in capital and bold \(\mathbf {X}\). For a matrix \(\mathbf {X}\), we denote by \(x^i_j\) the entry on the ith row and jth column.
3.1 Formalism: concepts in relational databases and statistical learning
We first link our formulations to a database formalism, which relies on sets. A table is specified by its relational scheme \(\mathcal {R}\): the set of m attribute names \(\{A_j, j =1\ldots m\}\), i.e., the column names (Maier 1983). Each attribute name has a domain \(\text {dom}(A_j) = \mathcal {D}_j\). A table is defined as a relation r on the scheme \(\mathcal {R}\): a set of mappings (tuples) \(\{t^i: \mathcal {R} \rightarrow \bigcup _{j=1}^{m} \mathcal {D}_j, \; i=1\ldots n\}\), where for each record (sample) \(t^i \in r\), \(t^i(A_j) \in \mathcal {D}_j, \; j = 1\ldots m\). If \(A_j\) is a numerical attribute, then \(\text {dom}(A_j) = \mathcal {D}_j \subseteq \mathbb {R}\). If \(A_j\) is a categorical attribute represented by strings, then \(\mathcal {D}_j \subseteq \mathbb {S}\), where \(\mathbb {S}\) is the set of finitelength strings.^{5} As a shorthand, we call \(k_j = \text {card}(\mathcal {D}_j)\) the cardinality of the variable.
We now review classical encoding methods. For simplicity of exposition, in the rest of the section we will consider only a single categorical variable A, omitting the column index j from the previous definitions.
The onehot encoding method is intended to be used when categories are mutually exclusive (Cohen et al. 2013), which is not necessarily true of dirty data (e.g., misspelled variables should be interpreted as overlapping categories).
Another drawback of this method is that it provides no heuristics to assign a code vector to new categories that appear in the testing set but have not been encoded on the training set. Given the previous definition, the zero vector will be assigned to any new category in the testing set, which creates collisions if more that one new category is introduced.
Finally, highcardinality categorical variables greatly increase the dimensionality of the feature matrix, which increases its computational cost. Dimensionality reduction on the onehot encoding vector tackles this problem (see Sect. 4.2), with the risk of loosing information.
Hash encoding. A solution to reduce the dimensionality of the data is to use the hashing trick (Weinberger et al. 2009). Instead of assigning a different unit vector to each category, as onehot encoding does, one could define a hash function to designate a feature vector on a reduced vector space. This method does not consider the problem of dirty data either, because it assigns hash values that are independent of the morphological similarity between categories.
Another related approach is the MDV continuousification scheme (Grabczewski and Jankowski 2003), which encodes a category \(d^i\) by its expected value on each target \(c_k\), \(\mathbb {E}_\ell \bigl [d^\ell = d^i  \mathbf {y}^\ell = c_k\bigr ]\) instead of \(\mathbb {E}_\ell \bigl [\mathbf {y}^\ell d^\ell = d^i \bigr ]\) used in the VDM. In the case of a classification problem, \(c_k\) belongs to the set of possible classes for the target variable. However, in a dirty dataset, as with spelling mistakes, some categories can appear only once, undermining the meaning of their marginal link to \(\mathbf {y}\).
Embedding with neural networks. Guo and Berkhahn (2016) propose an encoding method based on neural networks. It is inspired by NLP methods that perform word embedding based on textual context (Mikolov et al. 2013) (see Sect. 3.2). In tabular data, the equivalent to this context is given by the values of the other columns, categorical or not. The approach is simply a standard neural network, trained to link the table \(\mathcal {R}\) to the target \(\mathbf {y}\) with standard supervisedlearning architectures and loss and as inputs the table with categorical columns onehot encoded. Yet, Guo and Berkhahn (2016) use as a first hidden layer a bottleneck for each categorical variable. The corresponding intermediate representation, learned by the network, gives a vector embedding of the categories in a reduced dimensionality. This approach is interesting as it guides the encoding in a supervised way. Yet, it entails the computational and architectureselection costs of deep learning. Additionally, it is still based on an initial onehot encoding which is susceptible to dirty categories.
Bag of ngrams. One way to represent morphological variation of strings is to build a vector containing the count of all possible ngrams of consecutive characters (or words). This method is straightforward and naturally creates vectorial representations where similar strings are close to each other. In this work we consider ngrams of characters to capture the morphology of short strings.
3.2 Related approaches in natural language processing
Stemming or lemmatizing. Stemming and lemmatizing are text preprocessing techniques that strive to extract a common root from different variants of a word (Lovins 1968; Hull 1996). For instance, ‘standardization’, ‘standards’, and ‘standard’ could all be reduced to ‘standard’. These techniques are based on a set of rules, crafted to the specificities of a language. Their drawbacks are that they may not be suited to a specific domain, such as medical practice, and are costly to develop. Some recent developments in NLP avoid stemming by working directly at the character level (Bojanowski et al. 2016).
Word embeddings. Capturing the idea that some categories are closer than others, such as ‘cervical spinal fusion’ being closer to ‘spinal fusion except cervical’ than to ‘renal failure’ in the medical charges dataset can be seen as a problem of learning semantics. Statistical approaches to semantics stem from lowrank data reductions of word occurrences: the original LSA (latent semantic analysis) (Landauer et al. 1998) is a PCA of the word occurrence matrix in documents; word2vec (Mikolov et al. 2013) can be seen as a matrix factorization on a matrix of word occurrence in local windows; and fastText (Bojanowski et al. 2016), a stateoftheart approach for supervised learning on text, is based on a lowrank representation of text.
However, these semanticscapturing embeddings for words cannot readily be used for categorical columns of a table. Indeed, tabular data seldom contain enough samples and enough context to train modern semantic approaches. Pretrained embeddings would not work for entries drawn from a given specialized domain, such as company names or medical vocabulary. Business or applicationspecific tables require domainspecific semantics.
3.3 Related approaches in database cleaning
Similarity queries. To cater for different ways information might appear, databases use queries with inexact matching. Queries using textual similarity help integration of heterogeneous databases without common domains (Cohen 1998).
Deduplication, record linkage, or fuzzy matching. In databases, deduplication or record linkage strives to find different variants that denote the same entity and match them (Elmagarmid et al. 2007). Classic record linkage theory deals with merging multiple tables that have entities in common. It seeks a combination of similarities across columns and a threshold to match rows (Fellegi and Sunter 1969). If known matching pairs of entities are available, this problem can be cast as a supervised or semisupervised learning problem (Elmagarmid et al. 2007). If there are no known matching pairs, the simplest solution boils down to a clustering approach, often on a similarity graph, or a related expectation maximization approach (Winkler 2002). Supervising the deduplication task is challenging and often calls for human intervention. Sarawagi and Bhamidipaty (2002) use active learning to minimize human effort. Much of the recent progress in database research strives for faster algorithms to tackle huge databases (Christen 2012).
4 Similarity encoding: robust feature engineering
4.1 Working principle of similarity encoding
Onehot encoding can be interpreted as a feature vector in which each dimension corresponds to the zeroone similarity between the category we want to encode and all the known categories (see Eq. 3). Instead of using this particular similarity, one can extend the encoding to use one of the many string similarities, e.g., as used for entity resolution. A survey of the most commonly used text similarity measures can be found in Cohen et al. (2003), Gomaa and Fahmy (2013). Most of these similarities are based on a morphological comparison between two strings. Identical strings will have a similarity equal to 1 and very different strings will have a similarity closer to 0. We first describe three of the most commonly used similarity measures:
There exist more efficient versions of the 3gram similarity (Kondrak 2005), but we do not explore them in this work.
4.2 Dimensionality reduction: approaches and experiments
With onehot or similarity encoding, highcardinality categorical variables lead to highdimensional feature vectors. This may lead to computational and statistical challenges. Dimensionality reduction may be used on the resulting feature matrix. A natural approach is to use Principal Component Analysis, as it captures the maximumvariance subspace. Yet, it entails a high computational cost^{9} and is cumbersome to run in a online setting. Hence, we explored using random projections: based on the JohnsonLindenstrauss lemma, these give a reduced representation that accurately approximates distances of the vector space (Rahimi and Recht 2008).
A drawback of such a projection approach is that it requires first computing the similarity to all categories. Also, it mixes the contribution of all categories in nontrivial ways and hence may make interpreting the encodings difficult. For this reason, we also explored prototype based methods: choosing a small number d of categories and encoding by computing the similarity to these prototypes. These prototypes should be representative of the full category set in order to have a meaningful reduced space.
One simple approach is to choose the \(d \ll k\) most frequent categories of the dataset. Another way of choosing prototype elements in the category set are clustering methods like kmeans, which chooses cluster centers that minimize a distortion measure. We use as prototype candidates the closest element to the center of each cluster. Note that we can apply the clustering on a initial version of the similarityencoding matrix computed on a subset of the data.
Clustering of dirty categories based on a string similarity is strongly related to deduplication or recordlinkage strategies used in database cleaning. One notable difference with using a cleaning strategy before statistical learning is that we are not converting the various forms of the categories to the corresponding cluster centers, but rather encoding their similarities to these.
5 Empirical study of similarity encoding
Dataset description
Dataset  Number of rows  Number of categories  Most frequent category  Least frequent category  Prediction type 

Medical charges  1.6E+05  100  3023  613  Regression 
Employee salaries  9.2E+03  385  883  1  Regression 
Open payments  1.0E+05  973  4016  1  Binaryclf 
Midwest survey  2.8E+03  1009  487  1  Multiclassclf 
Traffic violations  1.0E+05  3043  7817  1  Multiclassclf 
Road safety  1.0E+04  4617  589  1  Binaryclf 
Beer reviews  1.0E+04  4634  25  1  Multiclassclf 
Table 2 summarizes the characteristics of the datasets and the respective categorical variable (for more information about the data, see Sect. 8.1). The sample size of the datasets varies from 3000 to 160,000 and the cardinality of the selected categorical variable ranges from 100 to more than 4600 categories. Most datasets have at least one category that appears only once, hence when the data is split into a train and test set, some categories will likely be present only in the testing set. To measure prediction performance, we use the following metrics: \(R^2\) score for regression, average precision score for binary classification, and accuracy for multiclass classification. All these scores are upper bounded by 1 and higher values mean better predictions.
First, we benchmarked the similarity encoding with onehot encoding and other commonly used methods. Each boxplot in Fig. 2 contains the prediction scores of 100 random splits of the data (80% of the samples for training and 20% for testing) using gradient boosted trees and ridge regression. The right side of each plot shows the average ranking of each method across datasets in terms of the median value of the respective boxplots.
Figure 4 shows the prediction results for different dimensionality reduction methods applied to six of our seven datasets (medical charges was excluded from the figure because of its smaller cardinality in comparison with the other datasets). For dimensionality reduction, we investigated (i) random projections, (ii) encoding with similarities to the most frequent categories, (iii) encoding with similarities to categories closest to the centers of a kmeans clustering, and (iv) onehot encoding after merging categories with a kmeans clustering, which is a simple form of deduplication. The latter method enables bridging the gap with the deduplication literature: we can compare merging entities before statistical learning to expressing their similarity using the same similarity measure.
6 Discussion
Encoding categorical textual variables in dirty tables has not been studied much in the statisticallearning literature. Yet it is a common hurdle in many application settings. This paper shows that there is room for improvement upon the standard practice of onehot encoding by accounting for similarities across the categories. We studied similarity encoding, which is a very simple generalization of the onehot encoding method.^{12}
An important contribution of this paper is the empirical benchmarks on dirty tables. We selected seven realworld datasets containing at least one dirty categorical variable with highcardinality (see Table 2). These datasets are openly available, and we hope that they will foster more research on dirty categorical variables. By their diversity, they enable exploring the tradeoffs of encoding approaches and conclude on generallyuseful defaults.
Figure 5 also reveals that three of the seven datasets (medical charge, employee salaries and traffic violations) display a bimodal distribution in similarities. On these datasets, similarity encoding brings the largest gains over onehot encoding (Fig. 2). In these situations, similarity encoding is particularly useful as it gives a vector representation in which a nonnegligible number of category pairs are close to each other.
Performance comparisons with different classifiers (linear models and treebased models in Fig. 3) suggest that 3gram similarity reduces the gap between models by giving a better vector representation of the categories. Note that in these experiments linear models slightly outperformed treebased models, however we did not tune the hyper parameters of the tree learners.
While onehot encoding can be expressed as a sparse matrix, a drawback of similarity encoding is that it creates a dense feature matrix, leading to increased memory and computational costs. Dimensionality reduction of the resulting matrix maintains most of the benefits of similarity encoding (Fig. 4) even with a strong reduction (\(d=100\)).^{13} It greatly reduces the computational cost: fitting the models on our benchmark datasets takes on the order of seconds or minutes on commodity hardware (see Table 3 in the “Appendix”). Note that on some datasets, a random projection of onehot encoded vectors improves prediction for gradient boosting. We interpret this as a regularization that captures some semantic links across the categories, as with LSA. When more than one categorical variable is present, a related approach would be to use Correspondence Analysis (Shyu et al. 2005), which also seeks a lowrank representation as it can be interpreted as a weighted form of PCA for categorical data. Here we focus on methods that encode a single categorical variable.
The dimensionality reduction approaches that we have studied can be applied in an online learning setting: they either select a small number prototype categories, or perform a random projection. Hence, the approach can be applied on datasets that do not fit in memory.
Classic encoding methods are hard to apply in incremental machinelearning settings. Indeed, new samples with new categories require recomputation of the encoding representation, and hence retrain the model from scratch. This is not the case of similarity encoding because new categories are naturally encoded without creating collisions. We have shown the power of a straightforward strategy based on selecting 100 prototypes on subsampled data, for instance with kmeans clustering. Most importantly, no data cleaning on categorical variables is required to apply our methodology. Scraped data for commercial or marketing applications are good candidates to benefit from this approach.
7 Conclusion
Similarity encoding, a generalization of the onehot encoding method, allows a better representation of categorical variables, especially in the presence of dirty or highcardinality categorical data. Empirical results on seven realworld datasets show that 3gram similarity is a good choice to capture morphological resemblance between categories and to encode new categories that do not appear in the testing set. It improves prediction of the associated supervised learning task without any prior datacleaning step. Similarity encoding also outperforms representing categories via “bags of ngrams” of the associated strings. Its benefits carry over even with strong dimensionality reduction based on cheap operations such as random projections. This methodology can be used in onlinelearning settings, and hence can lead to tractable analysis on very large datasets without data cleaning. This paper only scratches the surface of statistical learning on noncurated tables, a topic that has not been studied much. We hope that the benchmarks datasets will foster more work on this subject.
Footnotes
 1.
Some methods, e.g., treebased, do not require vectorial encoding of categories (Coppersmith et al. 1999).
 2.
 3.
 4.
 5.
Note that the domain of the categorical variable depends on the training set.
 6.
Variants of onehot encoding include dummy coding, choosing the zero vector for a reference category, effects coding, contrast coding, and nonsense coding (Cohen et al. 2013).
 7.
The difference between methods is the interpretability of the values for each variable.
 8.
Two characters belonging to \(s_1\) and \(s_2\) are considered to be a match if they are identical and the difference in their respective positions does not exceed \(2 \, \text {max}(s_1,s_1)  1\). For m=0, the Jaro distance is set to 0.
 9.
Precisely, the cost of PCA is \(\mathcal {O}(n\,p\,\min (n, p))\).
 10.
Variables’ predictive power was evaluated with the feature importances of a Random Forest as implemented in scikitlearn (Pedregosa et al. 2011). The feature importance is calculated as the average (normalized) total reduction of the Gini impurity criterion brought by each feature.
 11.
We used the MD5 hash function with 256 components.
 12.
A Python implementation is available at https://dirtycat.github.io/.
 13.
With Gradient Boosting, similarity encoding reduced to \(d=30\) still outperforms onehot encoding. Indeed, tree models are good at capturing nonlinear decisions in low dimensions.
 14.
 15.
 16.
 17.
 18.
 19.
 20.
 21.
Experiments are available at https://github.com/pcerda/ecmlpkdd2018.
 22.
Notes
Acknowledgements
We would like to acknowledge the excellent feedback from the reviewers. This work was funded by the Wendelin and DirtyData (ANR17CE230018) grants.
References
 Alkharusi, H. (2012). Categorical variables in regression analysis: A comparison of dummy and effect coding. International Journal of Education, 4(2), 202–210.CrossRefGoogle Scholar
 Angell, R. C., Freund, G. E., & Willett, P. (1983). Automatic spelling correction using a trigram similarity measure. Information Processing & Management, 19(4), 255–261.CrossRefGoogle Scholar
 Berry, K. J., Mielke, P. W, Jr., & Iyer, H. K. (1998). Factorial designs and dummy coding. Perceptual and Motor Skills, 87(3), 919–927.CrossRefGoogle Scholar
 Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606
 Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.CrossRefGoogle Scholar
 Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2013). Applied multiple regression/correlation analysis for the behavioral sciences. London: Routledge.Google Scholar
 Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. Kdd Workshop on Data Cleaning and Object Consolidation, 3, 73–78.Google Scholar
 Cohen, W. W. (1998). Integration of heterogeneous databases without common domains using queries based on textual similarity. In ACM SIGMOD record (Vol. 27, pp. 201–212). ACM.Google Scholar
 Coppersmith, D., Hong, S. J., & Hosking, J. R. (1999). Partitioning nominal attributes in decision trees. Data Mining and Knowledge Discovery, 3(2), 197–217.CrossRefGoogle Scholar
 Davis, M. J. (2010). Contrast coding in multiple regression analysis: Strengths, weaknesses, and utility of popular coding structures. Journal of Data Science, 8(1), 61–73.Google Scholar
 Duch, W., Grudzinski, K., & Stawski, G. (2000). Symbolic features in neural networks. In Proceedings of the 5th conference on neural networks and their applications. Citeseer.Google Scholar
 Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.CrossRefGoogle Scholar
 Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.CrossRefzbMATHGoogle Scholar
 Gomaa, W. H., & Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13), 13–18.CrossRefGoogle Scholar
 Grabczewski, K., & Jankowski, N. (2003). Transformations of symbolic data for continuous data oriented models. In Artificial neural networks and neural information processing (pp. 359–366). Springer.Google Scholar
 Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737
 Hull, D. A., et al. (1996). Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1), 70–84.CrossRefGoogle Scholar
 Jaro, M. A. (1989). Advances in recordlinkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420.CrossRefGoogle Scholar
 Kim, W., Choi, B. J., Hong, E. K., Kim, S. K., & Lee, D. (2003). A taxonomy of dirty data. Data Mining and Knowledge Discovery, 7(1), 81–99.MathSciNetCrossRefGoogle Scholar
 Kondrak, G. (2005). Ngram similarity and distance. In International symposium on string processing and information retrieval (pp. 115–126). Springer.Google Scholar
 Krishnan, S., Franklin, M. J., Goldberg, K., & Wu, E. (2017). Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299.
 Krishnan, S., Wang, J., Wu, E., Franklin, M. J., & Goldberg, K. (2016). Activeclean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, 9(12), 948–959.CrossRefGoogle Scholar
 Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.CrossRefGoogle Scholar
 Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.MathSciNetGoogle Scholar
 Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1–2), 22–31.Google Scholar
 Maier, D. (1983). The theory of relational databases (Vol. 11). Rockville: Computer Science Press.zbMATHGoogle Scholar
 MicciBarreca, D. (2001). A preprocessing scheme for highcardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1), 27–32.CrossRefGoogle Scholar
 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR workshop papers.Google Scholar
 Myers, J. L., Well, A., & Lorch, R. F. (2010). Research design and statistical analysis. London: Routledge.Google Scholar
 O’Grady, K. E., & Medoff, D. R. (1988). Categorical variables in multiple regression: Some cautions. Multivariate Behavioral Research, 23(2), 243–2060.CrossRefGoogle Scholar
 Oliveira, P., Rodrigues, F., & Henriques, P. R. (2005). A formal definition of data quality problems. In Proceedings of the 2005 international conference on information quality (MIT IQ conference). Google Scholar
 Pedhazur, E. J., Kerlinger, F. N., et al. (1973). Multiple regression in behavioral research. Rinehart and Winston New York: Holt.Google Scholar
 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetzbMATHGoogle Scholar
 Pyle, D. (1999). Data preparation for data mining (Vol. 1). Burlington: Morgan Kaufmann.Google Scholar
 Rahimi, A., & Recht, B. (2008). Random features for largescale kernel machines. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 1177–1184). Curran Associates, Inc. http://papers.nips.cc/paper/3182randomfeaturesforlargescalekernelmachines.pdf.
 Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.Google Scholar
 Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 269–278). ACM.Google Scholar
 Shyu, M. L., Sarinnapakorn, K., KuruppuAppuhamilage, I., Chen, S. C., Chang, L., & Goldring, T. (2005). Handling nominal features in anomaly intrusion detection problems. In 15th international workshop on research issues in data engineering: Stream data mining and applications (pp. 55–62). IEEE.Google Scholar
 Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). Feature hashing for large scale multitask learning. In Proceedings of the 26th annual international conference on machine learning (pp. 1113–1120). ACM.Google Scholar
 Winkler, W. E. (1999). The state of record linkage and current research problems. Citeseer: Statistical Research Division, US Census Bureau.Google Scholar
 Winkler, W. E. (2002). Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC.Google Scholar