Abstract
Ranking is a strategy widely used for estimating relevance or importance of available characteristic features. Depending on the applied methodology, variables are assessed individually or as subsets, by some statistics referring to information theory, machine learning algorithms, or specialised procedures that execute systematic search through the feature space. The information about importance of attributes can be used in the pre-processing step of initial data preparation, to remove irrelevant or superfluous elements. It can also be employed in post-processing, for optimisation of already constructed classifiers. The chapter describes research on the latter approach, involving filtering inferred decision rules while exploiting ranking positions and scores of features. The optimised rule classifiers were applied in the domain of stylometric analysis of texts for the task of binary authorship attribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The works are available for on-line reading and download in various e-book formats thanks to Project Gutenberg (see http://www.gutenberg.org).
References
Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, A.: Applying data mining techniques in text analysis. Technical report C-1997-23, Department of Computer Science, University of Helsinki, Finland (1997)
Argamon, S., Burns, K., Dubnov, S. (eds.): The Structure of Style: Algorithmic Approaches to Understanding Manner and Meaning. Springer, Berlin (2010)
Argamon, S., Karlgren, J., Shanahan, J.: Stylistic analysis of text for information access. In: Proceedings of the 28th International ACM Conference on Research and Development in Information Retrieval, Brazil (2005)
Baayen, H., van Haltern, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit. Linguist. Comput. 11(3), 121â132 (1996)
Baron, G.: Comparison of cross-validation and test sets approaches to evaluation of classifiers in authorship attribution domain. In: CzachĂłrski, T., Gelenbe, E., Grochla, K., Lent, R. (eds.) Proceedings of the 31st International Symposium on Computer and Information Sciences, Communications in Computer and Information Science, vol. 659, pp. 81â89. Springer, Cracow (2016)
Bayardo Jr., R., Agrawal, R.: Mining the most interesting rules. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 145â154 (1999)
Biesiada, J., Duch, W., Kachel, A., PaĆucha, S.: Feature ranking methods based on information entropy with Parzen windows. In: Proceedings of International Conference on Research in Electrotechnology and Applied Informatics, pp. 109â119, Katowice (2005)
Craig, H.: Stylistic analysis and authorship studies. In: Schreibman, S., Siemens, R., Unsworth, J. (eds.) A Companion to Digital Humanities. Blackwell, Oxford (2004)
Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1, 131â156 (1997)
Deuntsch, I., Gediga, G.: Rough Set Data Analysis: A Road to Noninvasive Knowledge Discovery. Matho\(\delta \)os Publishers, Bangor (2000)
Fiesler, E., Beale, R.: Handbook of Neural Computation. Oxford University Press, Oxford (1997)
Greco, S., Matarazzo, B., SĆowiĆski, R.: The use of rough sets and fuzzy sets in multi criteria decision making. In: Gal, T., Hanne, T., Stewart, T. (eds.) Advances in Multiple Criteria Decision Making, Chap. 14, pp. 14.1â14.59. Kluwer Academic Publishers, Dordrecht (1999)
Greco, S., Matarazzo, B., SĆowiĆski, R.: Rough set theory for multicriteria decision analysis. Eur. J. Oper. Res. 129(1), 1â47 (2001)
Greco, S., Matarazzo, B., SĆowiĆski, R.: Dominance-based rough set approach as a proper way of handling graduality in rough set theory. Trans. Rough Sets VII 4400, 36â52 (2007)
Greco, S., SĆowiĆski, R., Stefanowski, J.: Evaluating importance of conditions in the set of discovered rules. Lect. Notes Artif. Intell. 4482, 314â321 (2007)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157â1182 (2003)
Holte, R.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 63â91 (1993)
Jensen, R., Shen, Q.: Computational Intelligence and Feature Selection. Wiley, Hoboken (2008)
Jockers, M., Witten, D.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist. Comput. 25(2), 215â223 (2010)
John, G., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Cohen, W., Hirsh, H. (eds.) Machine Learning: Proceedings of the 11th International Conference, pp. 121â129. Morgan Kaufmann Publishers (1994)
Khmelev, D., Tweedie, F.: Using Markov chains for identification of writers. Lit. Linguist. Comput. 16(4), 299â307 (2001)
Koppel, M., Argamon, S., Shimoni, A.: Automatically categorizing written texts by author gender. Lit. Linguist. Comput. 17(4), 401â412 (2002)
Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman & Hall/CRC, Boca Raton (2008)
Lynam, T., Clarke, C., Cormack, G.: Information extraction with term frequencies. In: Proceedings of the Human Language Technology Conference, pp. 1â4. San Diego (2001)
Moshkov, M., Piliszczuk, M., Zielosko, B.: On partial covers, reducts and decision rules with weights. Trans. Rough Sets VI 4374, 211â246 (2006)
Pawlak, Z.: Computing, artificial intelligence and information technology: rough sets, decision algorithms and Bayesâ theorem. Eur. J. Oper. Res. 136, 181â189 (2002)
Pawlak, Z.: Rough sets and intelligent data analysis. Inf. Sci. 147, 1â12 (2002)
Peng, R.: Statistical aspects of literary style. Bachelorâs thesis, Yale University (1999)
Peng, R., Hengartner, H.: Quantitative analysis of literary styles. Am. Stat. 56(3), 15â38 (2002)
Shen, Q.: Rough feature selection for intelligent classifiers. Trans. Rough Sets VII 4400, 244â255 (2006)
Sikora, M.: Rule quality measures in creation and reduction of data rule models. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H., SĆowiĆski, R. (eds.) Rough Sets and Current Trends in Computing, Lecture Notes in Computer Science, vol. 4259, pp. 716â725. Springer, Berlin (2006)
SĆowiĆski, R., Greco, S., Matarazzo, B.: Dominance-based rough set approach to reasoning about ordinal data. Lect. Notes Comput. Sci. (Lect. Notes Artif. Intell.) 4585, 5â11 (2007)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538â556 (2009)
StaĆczyk, U.: Weighting of attributes in an embedded rough approach. In: Gruca, A., CzachĂłrski, T., Kozielski, S. (eds.) Man-Machine Interactions 3, Advances in Intelligent and Soft Computing, vol. 242, pp. 475â483. Springer, Berlin (2013)
StaĆczyk, U.: Attribute ranking driven filtering of decision rules. In: Kryszkiewicz, M., Cornelis, C., Ciucci, D., Medina-Moreno, J., Motoda, H., RaĆ, Z. (eds.) Rough Sets and Intelligent Systems Paradigms. Lecture Notes in Computer Science, vol. 8537, pp. 217â224. Springer, Berlin (2014)
StaĆczyk, U.: Feature evaluation by filter, wrapper and embedded approaches. In: StaĆczyk, U., Jain, L. (eds.) Feature Selection for Data and Pattern Recognition. Studies in Computational Intelligence, vol. 584, pp. 29â44. Springer, Berlin (2015)
StaĆczyk, U.: Selection of decision rules based on attribute ranking. J. Intell. Fuzzy Syst. 29(2), 899â915 (2015)
StaĆczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., CzachĂłrski, T. (eds.) Man-Mach. Interact. 4. Advances in Intelligent and Soft Computing, vol. 391, pp. 535â547. Springer, Berlin (2016)
StaĆczyk, U.: Weighting and pruning of decision rules by attributes and attribute rankings. In: CzachĂłrski, T., Gelenbe, E., Grochla, K., Lent, R. (eds.) Proceedings of the 31st International Symposium on Computer and Information Sciences, Communications in Computer and Information Science, vol. 659, pp. 106â114. Springer, Cracow (2016)
Witten, I., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
WrĂłbel, L., Sikora, M., Michalak, M.: Rule quality measures settings in classification, regression and survival rule induction â an empirical approach. Fundamenta Informaticae 149, 419â449 (2016)
Zielosko, B.: Application of dynamic programming approach to optimization of association rules relative to coverage and length. Fundamenta Informaticae 148(1â2), 87â105 (2016)
Zielosko, B.: Optimization of decision rules relative to coverageâcomparison of greedy and modified dynamic programming approaches. In: Gruca, A., Brachman, A., Kozielski, S., CzachĂłrski, T. (eds.) Man-Machine Interactions 4. Advances in Intelligent and Soft Computing, vol. 391, pp. 639â650. Springer, Berlin (2016)
Acknowledgements
In the research there was used WEKA workbench [40]. 4eMka Software exploited for DRSA processing [32] was developed at the Laboratory of Intelligent Decision Support Systems, PoznaĆ, Poland. The research was performed at the Silesian University of Technology, Gliwice, within the project BK/RAu2/2017.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
StaĆczyk, U. (2018). Ranking-Based Rule Classifier Optimisation. In: StaĆczyk, U., Zielosko, B., Jain, L. (eds) Advances in Feature Selection for Data and Pattern Recognition. Intelligent Systems Reference Library, vol 138. Springer, Cham. https://doi.org/10.1007/978-3-319-67588-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-67588-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67587-9
Online ISBN: 978-3-319-67588-6
eBook Packages: EngineeringEngineering (R0)