Skip to main content

Greedy Selection of Attributes to Be Discretised

  • Chapter
  • First Online:

Part of the book series: Studies in Computational Intelligence ((SCI,volume 801))

Abstract

It is well known that discretisation of datasets in some cases may improve the quality of a decision system. Such effects were observed many times during experiments conducted in stylometry domain when authorship attribution tasks were performed. However, some experiments delivered results worse than expected when all attributes in datasets were discretised. Therefore, the idea to test decision systems where only part of attributes is discretised arose. For the selection of attributes to be discretised the greedy forward and backward sequential selection methods were proposed and deeply investigated. Different supervised and unsupervised discretisation methods were employed. The Naive Bayes classifier was selected as the inducer in the decision system. The relation between the subsequent subsets of attributes being discretised and the performance of the decision system was observed. The research proved that there is the maximum of the measure of system quality in respect to the series of subsets of attributes being discretised, generated during the sequential selection processes. Therefore, the attempts to find the optimal subsets of attributes to be discretised are reasonable.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Chen, M.: A greedy algorithm with forward-looking strategy. In: Bednorz, W. (eds.) Greedy Algorithms, InTech (2008)

    Google Scholar 

  2. Dechter, A., Dechter, R.: On the greedy solution of ordering problems. ORSA J. Comput. 1(3), 181–189 (1989)

    Article  Google Scholar 

  3. Bang-Jensen, J., Gutin, G., Yeo, A.: When the greedy algorithm fails. Discrete Optim. 1, 121–127 (2004)

    Article  MathSciNet  Google Scholar 

  4. Caruana, R., Freitag, D.: Greedy attribute selection. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 28–36. Morgan Kaufmann (1994)

    Google Scholar 

  5. Stańczyk, U.: Weighting of features by sequential selection. In: Stańczyk, U., Jain, L.C. (eds) Feature Selection for Data and Pattern Recognition, pp. 71–90. Springer, Berlin, Heidelberg (2015)

    Google Scholar 

  6. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  7. Baron, G.: Influence of data discretization on efficiency of Bayesian Classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014)

    Article  Google Scholar 

  8. Baron, G.: On sequential selection of attributes to be discretized for authorship attribution. In: 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 229–234. July 2017

    Google Scholar 

  9. Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006)

    Article  Google Scholar 

  10. Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A.R.B., Stamatatos, E.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)

    Article  Google Scholar 

  11. Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)

    Article  Google Scholar 

  12. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  13. Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)

    Google Scholar 

  14. Zhao, Y., Zobel, J.: Searching with style: authorship attribution in classic literature. In: Proceedings of the Thirtieth Australasian Conference on Computer Science—Volume 62, ser, ACSC ’07 pp. 59–68. Australian Computer Society, Inc., Darlinghurst, Australia (2007)

    Google Scholar 

  15. Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds.) Information Retrieval Technology, pp. 174–189. Springer, Berlin, Heidelberg (2005)

    Google Scholar 

  16. Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit. Linguist. Comput. 11(3), 121–132 (1996)

    Article  Google Scholar 

  17. Dash, R., Paramguru, R.L., Dash, R.: Comparative analysis of supervised and unsupervised discretization techniques. Int. J. Adv. Sci. Technol. 2(3), 29–37 (2011)

    Google Scholar 

  18. Yang, Y., Webb, G.I., Wu, X.: Discretization Methods, pp. 113–130. Springer, Boston, MA, US (2005)

    Google Scholar 

  19. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann (1995)

    Google Scholar 

  20. García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)

    Article  Google Scholar 

  21. Bay, S.D.: Multivariate discretization of continuous variables for set mining. In: 2000 Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser (KDD’00), pp. 315–319. ACM New York, NY, USA (2000)

    Google Scholar 

  22. Bakar, A.A., Othman, Z.A., Shuib, N.L.M.: Building a new taxonomy for data discretization techniques. In: 2009 2nd Conference on Data Mining and Optimization, pp. 132–140. Oct 2009

    Google Scholar 

  23. Peng, L., Qing, W., Yujia, G.: Study on comparison of discretization methods. In: 2009 International Conference on Artificial Intelligence and Computational Intelligence, vol. 4, pp. 380–384. Nov 2009

    Google Scholar 

  24. Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. Int. Trans. Comput. Sci. Eng. 1(32), 47–58 (2006)

    Google Scholar 

  25. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuousvalued attributes for classification learning. In: 13th International Joint Conference on Articial Intelligence, vol. 2, pp. 1022–1027. Morgan Kaufmann Publishers (1993)

    Google Scholar 

  26. Kononenko, I.: On biases in estimating multi-valued attributes. In: 14th International Joint Conference on Articial Intelligence, pp. 1034–1040 (1995)

    Google Scholar 

  27. Baron, G.: Comparison of cross-validation and test sets approaches to evaluation of classifiers in authorship attribution domain. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds.) Computer and Information Sciences: 31st International Symposium, ISCIS 2016, Kraków, Poland, October 27–28, 2016, Proceedings, pp. 81–89. Springer International Publishing, Cham (2016)

    Chapter  Google Scholar 

  28. Baron, G., Harężlak, K.: On approaches to discretization of datasets used for evaluation of decision systems. In: Czarnowski, I., Caballero, M.A., Howlett, J.R., Jain, C.L., (eds.) Intelligent Decision Technologies 2016: Proceedings of the 8th KES International Conference on Intelligent Decision Technologies (KES-IDT 2016)—Part II, pp. 149–159. Springer International Publishing, Cham (2016)

    Google Scholar 

  29. Zhang, H.: The Optimality of Naive Bayes. In: Barr, V., Markov, Z. (eds.) FLAIRS Conference. AAAI Press (2004)

    Google Scholar 

  30. McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop On Learning For Text Categorization, pp. 41–48. AAAI Press (1998)

    Google Scholar 

  31. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2), 103–130 (1997)

    Article  Google Scholar 

  32. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  33. John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann (1995)

    Google Scholar 

  34. Sardinha, B.: Using Key Words in Text Analysis: practical aspects. http://www2.lael.pucsp.br/direct/DirectPapers42.pdf (1999). Accessed 4 Jan 2018

  35. Peng, R.D., Hengartner, N.W.: Quantitative analysis of literary styles. Am. Stat. 56(3), 175–185 (2002)

    Article  MathSciNet  Google Scholar 

  36. Argamon, S., Karlgren, J., Shanahan, J.G.: Stylistic analysis of text for information access. In: 28th Annual International ACM Conference on Research and Development in Information Retrieval. Brazil (2005)

    Google Scholar 

  37. Stańczyk, U.: Decision rule length as a basis for evaluation of attribute relevance. J. Intel. Fuzzy Syst. 24(3), 429–445 (2013)

    Google Scholar 

  38. Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man-Machine Interactions 4: 4th International Conference on Man-Machine Interactions, ICMMI 2015 Kocierz Pass, Poland, October 6–9, 2015, pp. 535–547. Springer International Publishing, Cham (2016)

    Google Scholar 

Download references

Acknowledgements

The research described was performed using WEKA workbench [6] at the Silesian University of Technology, Gliwice, Poland, in the framework of the project BK/RAu2/2018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Grzegorz Baron .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Baron, G. (2019). Greedy Selection of Attributes to Be Discretised. In: Hassanien, A. (eds) Machine Learning Paradigms: Theory and Application. Studies in Computational Intelligence, vol 801. Springer, Cham. https://doi.org/10.1007/978-3-030-02357-7_3

Download citation

Publish with us

Policies and ethics