A clustering-based feature selection method for automatically generated relational attributes

  • Mostafa Rezaei
  • Ivor Cribben
  • Michele SamoraniEmail author
S.I.: Data Mining and Analytics


Although data mining problems require a flat mining table as input, in many real-world applications analysts are interested in finding patterns in a relational database. To this end, new methods and software have been recently developed that automatically add attributes (or features) to a target table of a relational database which summarize information from all other tables. When attributes are automatically constructed by these methods, selecting the important attributes is particularly difficult, because a large number of the attributes are highly correlated. In this setting, attribute selection techniques such as the Least Absolute Shrinkage and Selection Operator (lasso), elastic net, and other machine learning methods tend to under-perform. In this paper, we introduce a novel attribute selection procedure, where after an initial screening step, we cluster the attributes into different groups and apply the group lasso to select both the true attributes groups and then the true attributes. The procedure is particularly suited to high dimensional data sets where the attributes are highly correlated. We test our procedure on several simulated data sets and a real-world data set from a marketing database. The results show that our proposed procedure obtains a higher predictive performance while selecting a much smaller set of attributes when compared to other state-of-the-art methods.


Relational attribute generation Feature selection Lasso Elastic net Clustering 


  1. Anderson, E. T., Hansen, K., & Simester, D. (2009). The option value of returns: Theory and empirical evidence. Marketing Science, 28(3), 405–423.CrossRefGoogle Scholar
  2. Batini, C., Ceri, S., & Navathe, S. (1989). Entity relationship approach. North Holland: Elsevier Science Publishers BV.Google Scholar
  3. Bondell, H. D., & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64(1), 115–123.CrossRefGoogle Scholar
  4. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefGoogle Scholar
  5. Buhlmann, P., Rutimann, P., van de Geer, S., & Zhang, C. (2013). Correlated variables in regression: Clustering and sparse estimation. Journal of Statistical Planning and Inference, 143(11), 1835–1858.CrossRefGoogle Scholar
  6. Dettling, M., & Bühlmann, P. (2004). Finding predictive gene groups from microarray data. Journal of Multivariate Analysis, 90(1), 106–131.CrossRefGoogle Scholar
  7. Fan, J., & LV, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101.Google Scholar
  8. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.Google Scholar
  9. Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1.CrossRefGoogle Scholar
  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.CrossRefGoogle Scholar
  11. Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts and techniques: concepts and techniques. Amsterdam: Elsevier.Google Scholar
  12. Hastie, T., Tibshirani, R., Botstein, D., & Brown, P. (2001). Supervised harvesting of expression trees. Genome Biology, 2(1), 1–0003.CrossRefGoogle Scholar
  13. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Prediction, inference and data mining (2nd ed.). New York: Springer.CrossRefGoogle Scholar
  14. Hess, J. D., Chu, W., & Gerstner, E. (1996). Controlling product returns in direct marketing. Marketing Letters, 7(4), 307–317.CrossRefGoogle Scholar
  15. Hess, J. D., & Mayhew, G. E. (1997). Modeling merchandise returns in direct marketing. Journal of Interactive Marketing, 11(2), 20–35.Google Scholar
  16. Huang, J., Ma, S., Li, H., & Zhang, C. H. (2011). The sparse laplacian shrinkage estimator for high-dimensional regression. Annals of Statistics, 39(4), 2021.CrossRefGoogle Scholar
  17. Hwang, K., Kim, D., Lee, K., Lee, C., & Park, S. (2017). Embedded variable selection method using signomial classification. Annals of Operations Research, 254(1–2), 89–109.CrossRefGoogle Scholar
  18. Janakiraman, N., & Ordóñez, L. (2012). Effect of effort and deadlines on consumer product returns. Journal of Consumer Psychology, 22(2), 260–271.CrossRefGoogle Scholar
  19. Kendall, M. (1957). A course in multivariate analysis. London: Griffin.Google Scholar
  20. Knobbe, A. J., De Haas, M., & Siebes, A. (2001). Propositionalisation and aggregates. In L. De Raedt & A. Siebes (Eds.), Principles of data mining and knowledge discovery (pp. 277–288). Berlin: Springer.CrossRefGoogle Scholar
  21. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Berlin: Springer.CrossRefGoogle Scholar
  22. Mollenkopf, D. A., Frankel, R., & Russo, I. (2011). Creating value through returns management: Exploring the marketing-operations interface. Journal of Operations Management, 29(5), 391–403.CrossRefGoogle Scholar
  23. Ni, J., Neslin, S., & Sun, B. (2012). Database submission—The ISMS durable goods data sets. Marketing Science, 31(6), 1008–1013.CrossRefGoogle Scholar
  24. Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1–2), 65–105.CrossRefGoogle Scholar
  25. Petersen, J. A., & Kumar, V. (2009). Are product returns a necessary evil? Antecedents and consequences. Journal of Marketing, 73(3), 35–51.CrossRefGoogle Scholar
  26. Petersen, J. A., & Kumar, V. (2015). Perceived risk, product returns, and optimal resource allocation: Evidence from a field experiment. Journal of Marketing Research, 52(2), 268–285.CrossRefGoogle Scholar
  27. Popescul, A., & Ungar, L. H. (2003). Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data (Vol. 2003).Google Scholar
  28. Reynolds, A., Richards, G., de la Iglesia, B., & Rayward-Smith, V. (2006). Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms, 5(4), 475–504.CrossRefGoogle Scholar
  29. Samorani, M. (2015). Automatically generate a flat mining table with dataconda. In 2015 IEEE international conference on data mining workshop (ICDMW), IEEE (pp. 1644–1647).Google Scholar
  30. Samorani, M., Ahmed, F., & Zaiane, O. R. (2016). Automatic generation of relational attributes: An application to product returns. In 2016 IEEE international conference on big data (Big Data) (pp. 1454–1463).
  31. Samorani, M., Laguna, M., DeLisle, R. K., & Weaver, D. C. (2011). A randomized exhaustive propositionalization approach for molecule classification. INFORMS Journal on Computing, 23(3), 331–345.CrossRefGoogle Scholar
  32. She, Y. (2008). Sparse regression with exact clustering. Ann Arbor: ProQuest.Google Scholar
  33. Shih, D. T., Kim, S. B., Chen, V. C., Rosenberger, J. M., & Pilla, V. L. (2014). Efficient computer experiment-based optimization through variable selection. Annals of Operations Research, 216(1), 287–305.CrossRefGoogle Scholar
  34. Simon, H. A. (1979). Rational decision making in business organizations. The American Economic Review, 69, 493–513.Google Scholar
  35. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.Google Scholar
  36. Yuan, M., & Lin, Y. (2007). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, 68(1), 49–67.CrossRefGoogle Scholar
  37. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418–1429.CrossRefGoogle Scholar
  38. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Operations and Information Systems, Alberta School of BusinessUniversity of AlbertaEdmontonCanada
  2. 2.Finance and Statistical Analysis, Alberta School of BusinessUniversity of AlbertaEdmontonCanada
  3. 3.Information Systems & Analytics, Leavey School of BusinessSanta Clara UniversitySanta ClaraUSA

Personalised recommendations