Greedy Selection of Attributes to Be Discretised

  • Grzegorz BaronEmail author
Part of the Studies in Computational Intelligence book series (SCI, volume 801)


It is well known that discretisation of datasets in some cases may improve the quality of a decision system. Such effects were observed many times during experiments conducted in stylometry domain when authorship attribution tasks were performed. However, some experiments delivered results worse than expected when all attributes in datasets were discretised. Therefore, the idea to test decision systems where only part of attributes is discretised arose. For the selection of attributes to be discretised the greedy forward and backward sequential selection methods were proposed and deeply investigated. Different supervised and unsupervised discretisation methods were employed. The Naive Bayes classifier was selected as the inducer in the decision system. The relation between the subsequent subsets of attributes being discretised and the performance of the decision system was observed. The research proved that there is the maximum of the measure of system quality in respect to the series of subsets of attributes being discretised, generated during the sequential selection processes. Therefore, the attempts to find the optimal subsets of attributes to be discretised are reasonable.



The research described was performed using WEKA workbench [6] at the Silesian University of Technology, Gliwice, Poland, in the framework of the project BK/RAu2/2018.


  1. 1.
    Chen, M.: A greedy algorithm with forward-looking strategy. In: Bednorz, W. (eds.) Greedy Algorithms, InTech (2008)Google Scholar
  2. 2.
    Dechter, A., Dechter, R.: On the greedy solution of ordering problems. ORSA J. Comput. 1(3), 181–189 (1989)CrossRefGoogle Scholar
  3. 3.
    Bang-Jensen, J., Gutin, G., Yeo, A.: When the greedy algorithm fails. Discrete Optim. 1, 121–127 (2004)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Caruana, R., Freitag, D.: Greedy attribute selection. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 28–36. Morgan Kaufmann (1994)Google Scholar
  5. 5.
    Stańczyk, U.: Weighting of features by sequential selection. In: Stańczyk, U., Jain, L.C. (eds) Feature Selection for Data and Pattern Recognition, pp. 71–90. Springer, Berlin, Heidelberg (2015)Google Scholar
  6. 6.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  7. 7.
    Baron, G.: Influence of data discretization on efficiency of Bayesian Classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014)CrossRefGoogle Scholar
  8. 8.
    Baron, G.: On sequential selection of attributes to be discretized for authorship attribution. In: 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 229–234. July 2017Google Scholar
  9. 9.
    Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006)CrossRefGoogle Scholar
  10. 10.
    Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A.R.B., Stamatatos, E.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)CrossRefGoogle Scholar
  11. 11.
    Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)CrossRefGoogle Scholar
  12. 12.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)CrossRefGoogle Scholar
  13. 13.
    Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)Google Scholar
  14. 14.
    Zhao, Y., Zobel, J.: Searching with style: authorship attribution in classic literature. In: Proceedings of the Thirtieth Australasian Conference on Computer Science—Volume 62, ser, ACSC ’07 pp. 59–68. Australian Computer Society, Inc., Darlinghurst, Australia (2007)Google Scholar
  15. 15.
    Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds.) Information Retrieval Technology, pp. 174–189. Springer, Berlin, Heidelberg (2005)Google Scholar
  16. 16.
    Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit. Linguist. Comput. 11(3), 121–132 (1996)CrossRefGoogle Scholar
  17. 17.
    Dash, R., Paramguru, R.L., Dash, R.: Comparative analysis of supervised and unsupervised discretization techniques. Int. J. Adv. Sci. Technol. 2(3), 29–37 (2011)Google Scholar
  18. 18.
    Yang, Y., Webb, G.I., Wu, X.: Discretization Methods, pp. 113–130. Springer, Boston, MA, US (2005)Google Scholar
  19. 19.
    Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann (1995)Google Scholar
  20. 20.
    García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)CrossRefGoogle Scholar
  21. 21.
    Bay, S.D.: Multivariate discretization of continuous variables for set mining. In: 2000 Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser (KDD’00), pp. 315–319. ACM New York, NY, USA (2000)Google Scholar
  22. 22.
    Bakar, A.A., Othman, Z.A., Shuib, N.L.M.: Building a new taxonomy for data discretization techniques. In: 2009 2nd Conference on Data Mining and Optimization, pp. 132–140. Oct 2009Google Scholar
  23. 23.
    Peng, L., Qing, W., Yujia, G.: Study on comparison of discretization methods. In: 2009 International Conference on Artificial Intelligence and Computational Intelligence, vol. 4, pp. 380–384. Nov 2009Google Scholar
  24. 24.
    Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. Int. Trans. Comput. Sci. Eng. 1(32), 47–58 (2006)Google Scholar
  25. 25.
    Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuousvalued attributes for classification learning. In: 13th International Joint Conference on Articial Intelligence, vol. 2, pp. 1022–1027. Morgan Kaufmann Publishers (1993)Google Scholar
  26. 26.
    Kononenko, I.: On biases in estimating multi-valued attributes. In: 14th International Joint Conference on Articial Intelligence, pp. 1034–1040 (1995)Google Scholar
  27. 27.
    Baron, G.: Comparison of cross-validation and test sets approaches to evaluation of classifiers in authorship attribution domain. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds.) Computer and Information Sciences: 31st International Symposium, ISCIS 2016, Kraków, Poland, October 27–28, 2016, Proceedings, pp. 81–89. Springer International Publishing, Cham (2016)CrossRefGoogle Scholar
  28. 28.
    Baron, G., Harężlak, K.: On approaches to discretization of datasets used for evaluation of decision systems. In: Czarnowski, I., Caballero, M.A., Howlett, J.R., Jain, C.L., (eds.) Intelligent Decision Technologies 2016: Proceedings of the 8th KES International Conference on Intelligent Decision Technologies (KES-IDT 2016)—Part II, pp. 149–159. Springer International Publishing, Cham (2016)Google Scholar
  29. 29.
    Zhang, H.: The Optimality of Naive Bayes. In: Barr, V., Markov, Z. (eds.) FLAIRS Conference. AAAI Press (2004)Google Scholar
  30. 30.
    McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop On Learning For Text Categorization, pp. 41–48. AAAI Press (1998)Google Scholar
  31. 31.
    Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2), 103–130 (1997)CrossRefGoogle Scholar
  32. 32.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  33. 33.
    John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann (1995)Google Scholar
  34. 34.
    Sardinha, B.: Using Key Words in Text Analysis: practical aspects. (1999). Accessed 4 Jan 2018
  35. 35.
    Peng, R.D., Hengartner, N.W.: Quantitative analysis of literary styles. Am. Stat. 56(3), 175–185 (2002)MathSciNetCrossRefGoogle Scholar
  36. 36.
    Argamon, S., Karlgren, J., Shanahan, J.G.: Stylistic analysis of text for information access. In: 28th Annual International ACM Conference on Research and Development in Information Retrieval. Brazil (2005)Google Scholar
  37. 37.
    Stańczyk, U.: Decision rule length as a basis for evaluation of attribute relevance. J. Intel. Fuzzy Syst. 24(3), 429–445 (2013)Google Scholar
  38. 38.
    Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man-Machine Interactions 4: 4th International Conference on Man-Machine Interactions, ICMMI 2015 Kocierz Pass, Poland, October 6–9, 2015, pp. 535–547. Springer International Publishing, Cham (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Institute of Computer Science, Silesian University of TechnologyGliwicePoland

Personalised recommendations