Abstract
Relational data models are usually used by researchers and companies when a single-table model is not enough to describe their system. Then, when it comes to classification, there are mainly two options: apply the corresponding relational version of classification algorithms or use a propositionalization technique to transform the relational database into a single-table representation before classification. In this work, we evaluate a fast and simple propositionalization algorithm called Wordification. This technique uses the table name, attribute name and value to create a feature. Each feature is treated as a word and the instances of the database are represented by a Bag-Of-Words (BOW) model. Then, a weighting scheme is used to weight the features for each instance. The original implementation of Wordification only explored the TF-IDF, the term-frequency and the binary weighting schemes. However, works in the text classification and data mining fields show that the proper choice of weighting schemes can boost classification. Therefore, we empirically experimented different term weighting approaches with Wordification. Our results show that the right combination of weighting scheme and classification algorithm can significantly improve classification performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Datasets available at http://kt.ijs.si/janez_kranjc/ilp_datasets/.
- 2.
Datasets available at https://relational.fit.cvut.cz/.
References
Bühlmann, H., Gisler, A.: A Course in Credibility Theory and its Applications. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-29273-X
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. JMLR 7, 1–30 (2006). http://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf
Fayyad, U.M., Irani, K.B.: Multi-Interval discretization of Continuous-Valued attributes for classification learning. In: Proceedings of the Conference of International Joint Conferences on Artificial Intelligence (1993). https://www.semanticscholar.org/paper/1dc53b91327cab503acc0ca5afb9155882b717a5
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. (2014). http://www.jmlr.org/papers/volume15/delgado14a/delgado14a.pdf
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937). https://www.tandfonline.com/doi/abs/10.1080/01621459.1937.10503522
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982). http://dx.doi.org/10.1148/radiology.143.1.7063747
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, July 2016. http://arxiv.org/abs/1607.01759
Kim, Y., Zhang, O.: Credibility adjusted term frequency: a supervised term weighting scheme for sentiment analysis and text classification. In: Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, May 2014
Kołcz, A., Teo, C.H.: Feature weighting for improved classifier robustness. In: CEAS 2009: Sixth Conference on Email and Anti-spam (2009)
Kramer, S., Lavrač, N., Flach, P.: Propositionalization approaches to relational data mining. In: Džeroski, S., Lavrač, N. (eds.) Relational Data Mining, pp. 262–291. Springer, Heidelberg (2001), https://doi.org/10.1007/978-3-662-04599-2_11
Kramer, S., Pfahringer, B., Helma, C.: Stochastic propositionalization of non-determinate background knowledge. In: Page, D. (ed.) ILP 1998. LNCS, vol. 1446, pp. 80–94. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0027312
Kranjc, J., Podpečan, V., Lavrač, N.: ClowdFlows: A Cloud Based Scientific Workflow Platform. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 816–819. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33486-3_54
Kuželka, O., Železný, F.: Block-wise construction of tree-like relational features with monotone reducibility and redundancy. Mach. Learn. 83(2), 163–192 (2011). https://doi.org/10.1007/s10994-010-5208-5
Lan, M., Tan, C.L., Low, H.B.: Proposing a new term weighting scheme for text categorization. In: AAAI, vol. 6, pp. 763–768 (2006). https://www.aaai.org/Papers/AAAI/2006/AAAI06-121.pdf
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009). https://doi.org/10.1109/TPAMI.2008.110
Lavrač, N., Džeroski, S., Grobelnik, M.: Learning nonrecursive definitions of relations with linus. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482, pp. 265–281. Springer, Heidelberg (1991). https://doi.org/10.1007/BFb0017020
Martineau, J.C., Finin, T.: Delta TFIDF: an improved feature space for sentiment analysis. In: Third International AAAI Conference on Weblogs and Social Media (2009). https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewPaper/187
Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Systems Appl. 106, 36–54 (2018). https://doi.org/10.1016/j.eswa.2018.03.058
Michie, D., Muggleton, S., Page, D., Srinivasan, A.: To the international computing community: A new East-West challenge. Technical report, Oxford University Computing laboratory, Oxford, UK (1994)
Motl, J., Schulte, O.: The ctu prague relational learning repository. arXiv preprint arXiv:1511.03086 (2015)
Nemenyi, P.: Distribution-free multiple comparisons. Biometrics 18, 263 (1962)
Oracle Corporation: MySQL (2019). https://www.mysql.com/
Paik, J.H.: A novel TF-IDF weighting scheme for effective ranking. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 343–352. SIGIR 2013, ACM, New York, NY, USA (2013). http://doi.acm.org/10.1145/2484028.2484070
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. JMLR 12, 2825–2830 (2011). http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf
Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B., Lavrač, N.: Wordification: propositionalization by unfolding relational data into bags of words. Expert Syst. Appl. 42(17), 6442–6456 (2015). https://doi.org/10.1016/j.eswa.2015.04.017
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3(4), 333–389 (2009). http://dx.doi.org/10.1561/1500000019
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Others: Okapi at TREC-3. NIST Special Publication 109, 109 (1995)
Sanner, M.F.: Python: a programming language for software integration and development. J. Molecular Graph. Modell. 17(1), 57–61 (1999). https://www.ncbi.nlm.nih.gov/pubmed/10660911
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283
Singhal, A., et al.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001). http://sifaka.cs.uiuc.edu/course/410s12/mir.pdf
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation (1972). https://www.emeraldinsight.com/doi/abs/10.1108/eb026526
Trotman, A.: Learning to rank. Inf. Retrieval 8(3), 359–381 (2005). https://doi.org/10.1007/s10791-005-6991-7
Wilcoxon, F.: Individual comparisons by ranking methods (1945). http://dx.doi.org/10.2307/3001968
Yang, Y., Liu, X., et al.: A re-examination of text categorization methods. In: Sigir. vol. 99, p. 99 (1999). http://people.csail.mit.edu/jim/temp/yang.pdf
Železný, F., Lavrač, N.: Propositionalization-based relational subgroup discovery with RSD. Mach. Learn. 62(1), 33–63 (2006). https://doi.org/10.1007/s10994-006-5834-0
Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert Syst. Appl. 82, 128–150 (2017). https://doi.org/10.1016/j.eswa.2017.04.003
Acknowledgment
We would like to thank the financial support of the Brazilian Research Agencies CAPES and CNPq, Janez Kranjc for some clarifications regarding Clowdflow, and to all the authors of Wordification for making it available.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sciammarella, T., Zaverucha, G. (2020). Weight Your Words: The Effect of Different Weighting Schemes on Wordification Performance. In: Kazakov, D., Erten, C. (eds) Inductive Logic Programming. ILP 2019. Lecture Notes in Computer Science(), vol 11770. Springer, Cham. https://doi.org/10.1007/978-3-030-49210-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-49210-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49209-0
Online ISBN: 978-3-030-49210-6
eBook Packages: Computer ScienceComputer Science (R0)