A clustering-based feature selection method for automatically generated relational attributes

Rezaei, Mostafa; Cribben, Ivor; Samorani, Michele

doi:10.1007/s10479-018-2830-2

A clustering-based feature selection method for automatically generated relational attributes

S.I.: Data Mining and Analytics
Published: 05 April 2018

Volume 303, pages 233–263, (2021)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

636 Accesses
4 Citations
Explore all metrics

Abstract

Although data mining problems require a flat mining table as input, in many real-world applications analysts are interested in finding patterns in a relational database. To this end, new methods and software have been recently developed that automatically add attributes (or features) to a target table of a relational database which summarize information from all other tables. When attributes are automatically constructed by these methods, selecting the important attributes is particularly difficult, because a large number of the attributes are highly correlated. In this setting, attribute selection techniques such as the Least Absolute Shrinkage and Selection Operator (lasso), elastic net, and other machine learning methods tend to under-perform. In this paper, we introduce a novel attribute selection procedure, where after an initial screening step, we cluster the attributes into different groups and apply the group lasso to select both the true attributes groups and then the true attributes. The procedure is particularly suited to high dimensional data sets where the attributes are highly correlated. We test our procedure on several simulated data sets and a real-world data set from a marketing database. The results show that our proposed procedure obtains a higher predictive performance while selecting a much smaller set of attributes when compared to other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

Notes

https://www.informs.org/Community/ISMS/ISMS-Research-Datasets.

References

Anderson, E. T., Hansen, K., & Simester, D. (2009). The option value of returns: Theory and empirical evidence. Marketing Science, 28(3), 405–423.
Article Google Scholar
Batini, C., Ceri, S., & Navathe, S. (1989). Entity relationship approach. North Holland: Elsevier Science Publishers BV.
Google Scholar
Bondell, H. D., & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64(1), 115–123.
Article Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article Google Scholar
Buhlmann, P., Rutimann, P., van de Geer, S., & Zhang, C. (2013). Correlated variables in regression: Clustering and sparse estimation. Journal of Statistical Planning and Inference, 143(11), 1835–1858.
Article Google Scholar
Dettling, M., & Bühlmann, P. (2004). Finding predictive gene groups from microarray data. Journal of Multivariate Analysis, 90(1), 106–131.
Article Google Scholar
Fan, J., & LV, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101.
Google Scholar
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.
Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1.
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.
Article Google Scholar
Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and techniques: concepts and techniques. Amsterdam: Elsevier.
Google Scholar
Hastie, T., Tibshirani, R., Botstein, D., & Brown, P. (2001). Supervised harvesting of expression trees. Genome Biology, 2(1), 1–0003.
Article Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Prediction, inference and data mining (2nd ed.). New York: Springer.
Book Google Scholar
Hess, J. D., Chu, W., & Gerstner, E. (1996). Controlling product returns in direct marketing. Marketing Letters, 7(4), 307–317.
Article Google Scholar
Hess, J. D., & Mayhew, G. E. (1997). Modeling merchandise returns in direct marketing. Journal of Interactive Marketing, 11(2), 20–35.
Google Scholar
Huang, J., Ma, S., Li, H., & Zhang, C. H. (2011). The sparse laplacian shrinkage estimator for high-dimensional regression. Annals of Statistics, 39(4), 2021.
Article Google Scholar
Hwang, K., Kim, D., Lee, K., Lee, C., & Park, S. (2017). Embedded variable selection method using signomial classification. Annals of Operations Research, 254(1–2), 89–109.
Article Google Scholar
Janakiraman, N., & Ordóñez, L. (2012). Effect of effort and deadlines on consumer product returns. Journal of Consumer Psychology, 22(2), 260–271.
Article Google Scholar
Kendall, M. (1957). A course in multivariate analysis. London: Griffin.
Google Scholar
Knobbe, A. J., De Haas, M., & Siebes, A. (2001). Propositionalisation and aggregates. In L. De Raedt & A. Siebes (Eds.), Principles of data mining and knowledge discovery (pp. 277–288). Berlin: Springer.
Chapter Google Scholar
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Berlin: Springer.
Book Google Scholar
Mollenkopf, D. A., Frankel, R., & Russo, I. (2011). Creating value through returns management: Exploring the marketing-operations interface. Journal of Operations Management, 29(5), 391–403.
Article Google Scholar
Ni, J., Neslin, S., & Sun, B. (2012). Database submission—The ISMS durable goods data sets. Marketing Science, 31(6), 1008–1013.
Article Google Scholar
Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1–2), 65–105.
Article Google Scholar
Petersen, J. A., & Kumar, V. (2009). Are product returns a necessary evil? Antecedents and consequences. Journal of Marketing, 73(3), 35–51.
Article Google Scholar
Petersen, J. A., & Kumar, V. (2015). Perceived risk, product returns, and optimal resource allocation: Evidence from a field experiment. Journal of Marketing Research, 52(2), 268–285.
Article Google Scholar
Popescul, A., & Ungar, L. H. (2003). Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data (Vol. 2003).
Reynolds, A., Richards, G., de la Iglesia, B., & Rayward-Smith, V. (2006). Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms, 5(4), 475–504.
Article Google Scholar
Samorani, M. (2015). Automatically generate a flat mining table with dataconda. In 2015 IEEE international conference on data mining workshop (ICDMW), IEEE (pp. 1644–1647).
Samorani, M., Ahmed, F., & Zaiane, O. R. (2016). Automatic generation of relational attributes: An application to product returns. In 2016 IEEE international conference on big data (Big Data) (pp. 1454–1463). https://doi.org/10.1109/BigData.2016.7840753.
Samorani, M., Laguna, M., DeLisle, R. K., & Weaver, D. C. (2011). A randomized exhaustive propositionalization approach for molecule classification. INFORMS Journal on Computing, 23(3), 331–345.
Article Google Scholar
She, Y. (2008). Sparse regression with exact clustering. Ann Arbor: ProQuest.
Google Scholar
Shih, D. T., Kim, S. B., Chen, V. C., Rosenberger, J. M., & Pilla, V. L. (2014). Efficient computer experiment-based optimization through variable selection. Annals of Operations Research, 216(1), 287–305.
Article Google Scholar
Simon, H. A. (1979). Rational decision making in business organizations. The American Economic Review, 69, 493–513.
Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.
Google Scholar
Yuan, M., & Lin, Y. (2007). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, 68(1), 49–67.
Article Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418–1429.
Article Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Operations and Information Systems, Alberta School of Business, University of Alberta, Edmonton, AB, T6G 2R6, Canada
Mostafa Rezaei
Finance and Statistical Analysis, Alberta School of Business, University of Alberta, Edmonton, AB, T6G 2R6, Canada
Ivor Cribben
Information Systems & Analytics, Leavey School of Business, Santa Clara University, Santa Clara, CA, 95053, USA
Michele Samorani

Authors

Mostafa Rezaei
View author publications
You can also search for this author in PubMed Google Scholar
Ivor Cribben
View author publications
You can also search for this author in PubMed Google Scholar
Michele Samorani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michele Samorani.

Appendices

Appendix A: Attribute generation with Dataconda

Dataconda is a software, freely available online, that generates attributes in a relational database. The user chooses a target table, to which Dataconda automatically adds new attributes that contain information contained in the rest of the database. In the example of Fig. 1, the target table is Purchases.

Here, we briefly illustrate how Dataconda generates attributes; the details are reported in Samorani et al. (2016). An attribute is built in two steps. In the first step, the procedure generates a large number of paths that start from the target table (and end anywhere).

In the second step, many attributes are generated for each path. The procedure generates attributes by virtually adding attributes to each table of the path, starting from the second to last table of the path and finishing with the first table of the path; each virtual attribute has the purpose of summarizing the information contained in the tables that follow along the path. When the procedure reaches the target table, the algorithm has constructed an attribute which summarizes the information contained in the path.

Suppose that the path built in the first step is Purchases P1 \(\rightarrow \) Clients C \(\rightarrow \) Purchases P2. The second step starts by virtually adding to table C an attribute that summarizes P2. Since the relationship linking C to P2 is 1-to-n (i.e., one client may have many purchases), the attribute virtually added to C is obtained by summarizing the purchases made by each client. Examples of virtual attributes that can be added to C the “number of purchases” made by each client, or “the average value of the attribute Return among the purchases of each client”, or the “number of purchases of where Online = 1 made by each client”. It is clear that there are many choices in building attributes, both in terms of aggregate functions to use (average, sum, count, etc) and in terms of “where clauses” to use (where Online = 1, where Online = 0, where Return = 1, etc). Dataconda allows the user to decide which aggregate function and “where clauses” to adopt for each attribute. After virtually adding an attribute to table C, the algorithm proceeds by adding to table P1, the target table, an attribute from table C that summarizes the rest of the path. Since the relationship linking P1 to C is 1-to-1 (i.e., one purchase has exactly one client), this attribute could simply be the virtual attribute added tot able C. In this way, we could add to the target table attributes such as “number of purchases made by the same client prior to the current purchase, or “the average value of the attribute Return among the purchases of the same client made prior to the current purchase”, or the “number of purchases of where \(\hbox {Online}=1\) made by the same client prior to the current purchase”.

Appendix B: Results for the six combinations with Shrinkage Method = elastic net on the simulated data set

See Tables 8 and 9.

Table 8 Results obtained with Shrinkage Method = elastic net—part 1 of 2—our proposed method is in bold

Full size table

Table 9 Results obtained with Shrinkage Method = elastic net—part 2 of 2—our proposed method is in bold

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rezaei, M., Cribben, I. & Samorani, M. A clustering-based feature selection method for automatically generated relational attributes. Ann Oper Res 303, 233–263 (2021). https://doi.org/10.1007/s10479-018-2830-2

Download citation

Published: 05 April 2018
Issue Date: August 2021
DOI: https://doi.org/10.1007/s10479-018-2830-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A clustering-based feature selection method for automatically generated relational attributes

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Attribute generation with Dataconda

Appendix B: Results for the six combinations with Shrinkage Method = elastic net on the simulated data set

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A clustering-based feature selection method for automatically generated relational attributes

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Attribute generation with Dataconda

Appendix B: Results for the six combinations with Shrinkage Method = elastic net on the simulated data set

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation