Greedy Selection of Attributes to Be Discretised

Baron, Grzegorz

doi:10.1007/978-3-030-02357-7_3

Greedy Selection of Attributes to Be Discretised

Grzegorz Baron³

Chapter
First Online: 08 December 2018

1650 Accesses
2 Citations

Part of the book series: Studies in Computational Intelligence ((SCI,volume 801))

Abstract

It is well known that discretisation of datasets in some cases may improve the quality of a decision system. Such effects were observed many times during experiments conducted in stylometry domain when authorship attribution tasks were performed. However, some experiments delivered results worse than expected when all attributes in datasets were discretised. Therefore, the idea to test decision systems where only part of attributes is discretised arose. For the selection of attributes to be discretised the greedy forward and backward sequential selection methods were proposed and deeply investigated. Different supervised and unsupervised discretisation methods were employed. The Naive Bayes classifier was selected as the inducer in the decision system. The relation between the subsequent subsets of attributes being discretised and the performance of the decision system was observed. The research proved that there is the maximum of the measure of system quality in respect to the series of subsets of attributes being discretised, generated during the sequential selection processes. Therefore, the attempts to find the optimal subsets of attributes to be discretised are reasonable.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Chen, M.: A greedy algorithm with forward-looking strategy. In: Bednorz, W. (eds.) Greedy Algorithms, InTech (2008)
Google Scholar
Dechter, A., Dechter, R.: On the greedy solution of ordering problems. ORSA J. Comput. 1(3), 181–189 (1989)
Article Google Scholar
Bang-Jensen, J., Gutin, G., Yeo, A.: When the greedy algorithm fails. Discrete Optim. 1, 121–127 (2004)
Article MathSciNet Google Scholar
Caruana, R., Freitag, D.: Greedy attribute selection. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 28–36. Morgan Kaufmann (1994)
Google Scholar
Stańczyk, U.: Weighting of features by sequential selection. In: Stańczyk, U., Jain, L.C. (eds) Feature Selection for Data and Pattern Recognition, pp. 71–90. Springer, Berlin, Heidelberg (2015)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Article Google Scholar
Baron, G.: Influence of data discretization on efficiency of Bayesian Classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014)
Article Google Scholar
Baron, G.: On sequential selection of attributes to be discretized for authorship attribution. In: 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 229–234. July 2017
Google Scholar
Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006)
Article Google Scholar
Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A.R.B., Stamatatos, E.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)
Article Google Scholar
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)
Google Scholar
Zhao, Y., Zobel, J.: Searching with style: authorship attribution in classic literature. In: Proceedings of the Thirtieth Australasian Conference on Computer Science—Volume 62, ser, ACSC ’07 pp. 59–68. Australian Computer Society, Inc., Darlinghurst, Australia (2007)
Google Scholar
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds.) Information Retrieval Technology, pp. 174–189. Springer, Berlin, Heidelberg (2005)
Google Scholar
Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit. Linguist. Comput. 11(3), 121–132 (1996)
Article Google Scholar
Dash, R., Paramguru, R.L., Dash, R.: Comparative analysis of supervised and unsupervised discretization techniques. Int. J. Adv. Sci. Technol. 2(3), 29–37 (2011)
Google Scholar
Yang, Y., Webb, G.I., Wu, X.: Discretization Methods, pp. 113–130. Springer, Boston, MA, US (2005)
Google Scholar
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann (1995)
Google Scholar
García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)
Article Google Scholar
Bay, S.D.: Multivariate discretization of continuous variables for set mining. In: 2000 Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser (KDD’00), pp. 315–319. ACM New York, NY, USA (2000)
Google Scholar
Bakar, A.A., Othman, Z.A., Shuib, N.L.M.: Building a new taxonomy for data discretization techniques. In: 2009 2nd Conference on Data Mining and Optimization, pp. 132–140. Oct 2009
Google Scholar
Peng, L., Qing, W., Yujia, G.: Study on comparison of discretization methods. In: 2009 International Conference on Artificial Intelligence and Computational Intelligence, vol. 4, pp. 380–384. Nov 2009
Google Scholar
Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. Int. Trans. Comput. Sci. Eng. 1(32), 47–58 (2006)
Google Scholar
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuousvalued attributes for classification learning. In: 13th International Joint Conference on Articial Intelligence, vol. 2, pp. 1022–1027. Morgan Kaufmann Publishers (1993)
Google Scholar
Kononenko, I.: On biases in estimating multi-valued attributes. In: 14th International Joint Conference on Articial Intelligence, pp. 1034–1040 (1995)
Google Scholar
Baron, G.: Comparison of cross-validation and test sets approaches to evaluation of classifiers in authorship attribution domain. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds.) Computer and Information Sciences: 31st International Symposium, ISCIS 2016, Kraków, Poland, October 27–28, 2016, Proceedings, pp. 81–89. Springer International Publishing, Cham (2016)
Chapter Google Scholar
Baron, G., Harężlak, K.: On approaches to discretization of datasets used for evaluation of decision systems. In: Czarnowski, I., Caballero, M.A., Howlett, J.R., Jain, C.L., (eds.) Intelligent Decision Technologies 2016: Proceedings of the 8th KES International Conference on Intelligent Decision Technologies (KES-IDT 2016)—Part II, pp. 149–159. Springer International Publishing, Cham (2016)
Google Scholar
Zhang, H.: The Optimality of Naive Bayes. In: Barr, V., Markov, Z. (eds.) FLAIRS Conference. AAAI Press (2004)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop On Learning For Text Categorization, pp. 41–48. AAAI Press (1998)
Google Scholar
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2), 103–130 (1997)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann (1995)
Google Scholar
Sardinha, B.: Using Key Words in Text Analysis: practical aspects. http://www2.lael.pucsp.br/direct/DirectPapers42.pdf (1999). Accessed 4 Jan 2018
Peng, R.D., Hengartner, N.W.: Quantitative analysis of literary styles. Am. Stat. 56(3), 175–185 (2002)
Article MathSciNet Google Scholar
Argamon, S., Karlgren, J., Shanahan, J.G.: Stylistic analysis of text for information access. In: 28th Annual International ACM Conference on Research and Development in Information Retrieval. Brazil (2005)
Google Scholar
Stańczyk, U.: Decision rule length as a basis for evaluation of attribute relevance. J. Intel. Fuzzy Syst. 24(3), 429–445 (2013)
Google Scholar
Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man-Machine Interactions 4: 4th International Conference on Man-Machine Interactions, ICMMI 2015 Kocierz Pass, Poland, October 6–9, 2015, pp. 535–547. Springer International Publishing, Cham (2016)
Google Scholar

Download references

Acknowledgements

The research described was performed using WEKA workbench [6] at the Silesian University of Technology, Gliwice, Poland, in the framework of the project BK/RAu2/2018.

Author information

Authors and Affiliations

Institute of Computer Science, Silesian University of Technology, Akademicka 16/317, 44-100, Gliwice, Poland
Grzegorz Baron

Authors

Grzegorz Baron
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Grzegorz Baron .

Editor information

Editors and Affiliations

Faculty of Computers and Information, Cairo University, Giza, Egypt
Aboul Ella Hassanien

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Baron, G. (2019). Greedy Selection of Attributes to Be Discretised. In: Hassanien, A. (eds) Machine Learning Paradigms: Theory and Application. Studies in Computational Intelligence, vol 801. Springer, Cham. https://doi.org/10.1007/978-3-030-02357-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-02357-7_3
Published: 08 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02356-0
Online ISBN: 978-3-030-02357-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics