Abstract
It is well known that discretisation of datasets in some cases may improve the quality of a decision system. Such effects were observed many times during experiments conducted in stylometry domain when authorship attribution tasks were performed. However, some experiments delivered results worse than expected when all attributes in datasets were discretised. Therefore, the idea to test decision systems where only part of attributes is discretised arose. For the selection of attributes to be discretised the greedy forward and backward sequential selection methods were proposed and deeply investigated. Different supervised and unsupervised discretisation methods were employed. The Naive Bayes classifier was selected as the inducer in the decision system. The relation between the subsequent subsets of attributes being discretised and the performance of the decision system was observed. The research proved that there is the maximum of the measure of system quality in respect to the series of subsets of attributes being discretised, generated during the sequential selection processes. Therefore, the attempts to find the optimal subsets of attributes to be discretised are reasonable.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Chen, M.: A greedy algorithm with forward-looking strategy. In: Bednorz, W. (eds.) Greedy Algorithms, InTech (2008)
Dechter, A., Dechter, R.: On the greedy solution of ordering problems. ORSA J. Comput. 1(3), 181–189 (1989)
Bang-Jensen, J., Gutin, G., Yeo, A.: When the greedy algorithm fails. Discrete Optim. 1, 121–127 (2004)
Caruana, R., Freitag, D.: Greedy attribute selection. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 28–36. Morgan Kaufmann (1994)
Stańczyk, U.: Weighting of features by sequential selection. In: Stańczyk, U., Jain, L.C. (eds) Feature Selection for Data and Pattern Recognition, pp. 71–90. Springer, Berlin, Heidelberg (2015)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Baron, G.: Influence of data discretization on efficiency of Bayesian Classifier for authorship attribution. Procedia Comput. Sci. 35, 1112–1121 (2014)
Baron, G.: On sequential selection of attributes to be discretized for authorship attribution. In: 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 229–234. July 2017
Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006)
Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A.R.B., Stamatatos, E.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)
Zhao, Y., Zobel, J.: Searching with style: authorship attribution in classic literature. In: Proceedings of the Thirtieth Australasian Conference on Computer Science—Volume 62, ser, ACSC ’07 pp. 59–68. Australian Computer Society, Inc., Darlinghurst, Australia (2007)
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds.) Information Retrieval Technology, pp. 174–189. Springer, Berlin, Heidelberg (2005)
Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit. Linguist. Comput. 11(3), 121–132 (1996)
Dash, R., Paramguru, R.L., Dash, R.: Comparative analysis of supervised and unsupervised discretization techniques. Int. J. Adv. Sci. Technol. 2(3), 29–37 (2011)
Yang, Y., Webb, G.I., Wu, X.: Discretization Methods, pp. 113–130. Springer, Boston, MA, US (2005)
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann (1995)
García, S., Luengo, J., Sáez, J.A., López, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)
Bay, S.D.: Multivariate discretization of continuous variables for set mining. In: 2000 Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser (KDD’00), pp. 315–319. ACM New York, NY, USA (2000)
Bakar, A.A., Othman, Z.A., Shuib, N.L.M.: Building a new taxonomy for data discretization techniques. In: 2009 2nd Conference on Data Mining and Optimization, pp. 132–140. Oct 2009
Peng, L., Qing, W., Yujia, G.: Study on comparison of discretization methods. In: 2009 International Conference on Artificial Intelligence and Computational Intelligence, vol. 4, pp. 380–384. Nov 2009
Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. Int. Trans. Comput. Sci. Eng. 1(32), 47–58 (2006)
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuousvalued attributes for classification learning. In: 13th International Joint Conference on Articial Intelligence, vol. 2, pp. 1022–1027. Morgan Kaufmann Publishers (1993)
Kononenko, I.: On biases in estimating multi-valued attributes. In: 14th International Joint Conference on Articial Intelligence, pp. 1034–1040 (1995)
Baron, G.: Comparison of cross-validation and test sets approaches to evaluation of classifiers in authorship attribution domain. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds.) Computer and Information Sciences: 31st International Symposium, ISCIS 2016, Kraków, Poland, October 27–28, 2016, Proceedings, pp. 81–89. Springer International Publishing, Cham (2016)
Baron, G., Harężlak, K.: On approaches to discretization of datasets used for evaluation of decision systems. In: Czarnowski, I., Caballero, M.A., Howlett, J.R., Jain, C.L., (eds.) Intelligent Decision Technologies 2016: Proceedings of the 8th KES International Conference on Intelligent Decision Technologies (KES-IDT 2016)—Part II, pp. 149–159. Springer International Publishing, Cham (2016)
Zhang, H.: The Optimality of Naive Bayes. In: Barr, V., Markov, Z. (eds.) FLAIRS Conference. AAAI Press (2004)
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop On Learning For Text Categorization, pp. 41–48. AAAI Press (1998)
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2), 103–130 (1997)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann (1995)
Sardinha, B.: Using Key Words in Text Analysis: practical aspects. http://www2.lael.pucsp.br/direct/DirectPapers42.pdf (1999). Accessed 4 Jan 2018
Peng, R.D., Hengartner, N.W.: Quantitative analysis of literary styles. Am. Stat. 56(3), 175–185 (2002)
Argamon, S., Karlgren, J., Shanahan, J.G.: Stylistic analysis of text for information access. In: 28th Annual International ACM Conference on Research and Development in Information Retrieval. Brazil (2005)
Stańczyk, U.: Decision rule length as a basis for evaluation of attribute relevance. J. Intel. Fuzzy Syst. 24(3), 429–445 (2013)
Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man-Machine Interactions 4: 4th International Conference on Man-Machine Interactions, ICMMI 2015 Kocierz Pass, Poland, October 6–9, 2015, pp. 535–547. Springer International Publishing, Cham (2016)
Acknowledgements
The research described was performed using WEKA workbench [6] at the Silesian University of Technology, Gliwice, Poland, in the framework of the project BK/RAu2/2018.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Baron, G. (2019). Greedy Selection of Attributes to Be Discretised. In: Hassanien, A. (eds) Machine Learning Paradigms: Theory and Application. Studies in Computational Intelligence, vol 801. Springer, Cham. https://doi.org/10.1007/978-3-030-02357-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-02357-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02356-0
Online ISBN: 978-3-030-02357-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)