Weight Your Words: The Effect of Different Weighting Schemes on Wordification Performance

Sciammarella, Tatiana; Zaverucha, Gerson

doi:10.1007/978-3-030-49210-6_10

Tatiana Sciammarella¹⁰ &
Gerson Zaverucha¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11770))

Included in the following conference series:

International Conference on Inductive Logic Programming

376 Accesses

Abstract

Relational data models are usually used by researchers and companies when a single-table model is not enough to describe their system. Then, when it comes to classification, there are mainly two options: apply the corresponding relational version of classification algorithms or use a propositionalization technique to transform the relational database into a single-table representation before classification. In this work, we evaluate a fast and simple propositionalization algorithm called Wordification. This technique uses the table name, attribute name and value to create a feature. Each feature is treated as a word and the instances of the database are represented by a Bag-Of-Words (BOW) model. Then, a weighting scheme is used to weight the features for each instance. The original implementation of Wordification only explored the TF-IDF, the term-frequency and the binary weighting schemes. However, works in the text classification and data mining fields show that the proper choice of weighting schemes can boost classification. Therefore, we empirically experimented different term weighting approaches with Wordification. Our results show that the right combination of weighting scheme and classification algorithm can significantly improve classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Datasets available at http://kt.ijs.si/janez_kranjc/ilp_datasets/.
2.
Datasets available at https://relational.fit.cvut.cz/.

References

Bühlmann, H., Gisler, A.: A Course in Credibility Theory and its Applications. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-29273-X
Book MATH Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. JMLR 7, 1–30 (2006). http://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf
MathSciNet MATH Google Scholar
Fayyad, U.M., Irani, K.B.: Multi-Interval discretization of Continuous-Valued attributes for classification learning. In: Proceedings of the Conference of International Joint Conferences on Artificial Intelligence (1993). https://www.semanticscholar.org/paper/1dc53b91327cab503acc0ca5afb9155882b717a5
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. (2014). http://www.jmlr.org/papers/volume15/delgado14a/delgado14a.pdf
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937). https://www.tandfonline.com/doi/abs/10.1080/01621459.1937.10503522
Article Google Scholar
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982). http://dx.doi.org/10.1148/radiology.143.1.7063747
Article Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, July 2016. http://arxiv.org/abs/1607.01759
Kim, Y., Zhang, O.: Credibility adjusted term frequency: a supervised term weighting scheme for sentiment analysis and text classification. In: Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, May 2014
Google Scholar
Kołcz, A., Teo, C.H.: Feature weighting for improved classifier robustness. In: CEAS 2009: Sixth Conference on Email and Anti-spam (2009)
Google Scholar
Kramer, S., Lavrač, N., Flach, P.: Propositionalization approaches to relational data mining. In: Džeroski, S., Lavrač, N. (eds.) Relational Data Mining, pp. 262–291. Springer, Heidelberg (2001), https://doi.org/10.1007/978-3-662-04599-2_11
Kramer, S., Pfahringer, B., Helma, C.: Stochastic propositionalization of non-determinate background knowledge. In: Page, D. (ed.) ILP 1998. LNCS, vol. 1446, pp. 80–94. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0027312
Chapter Google Scholar
Kranjc, J., Podpečan, V., Lavrač, N.: ClowdFlows: A Cloud Based Scientific Workflow Platform. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 816–819. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33486-3_54
Chapter Google Scholar
Kuželka, O., Železný, F.: Block-wise construction of tree-like relational features with monotone reducibility and redundancy. Mach. Learn. 83(2), 163–192 (2011). https://doi.org/10.1007/s10994-010-5208-5
Article MathSciNet Google Scholar
Lan, M., Tan, C.L., Low, H.B.: Proposing a new term weighting scheme for text categorization. In: AAAI, vol. 6, pp. 763–768 (2006). https://www.aaai.org/Papers/AAAI/2006/AAAI06-121.pdf
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009). https://doi.org/10.1109/TPAMI.2008.110
Article Google Scholar
Lavrač, N., Džeroski, S., Grobelnik, M.: Learning nonrecursive definitions of relations with linus. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482, pp. 265–281. Springer, Heidelberg (1991). https://doi.org/10.1007/BFb0017020
Chapter Google Scholar
Martineau, J.C., Finin, T.: Delta TFIDF: an improved feature space for sentiment analysis. In: Third International AAAI Conference on Weblogs and Social Media (2009). https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewPaper/187
Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Systems Appl. 106, 36–54 (2018). https://doi.org/10.1016/j.eswa.2018.03.058
Article Google Scholar
Michie, D., Muggleton, S., Page, D., Srinivasan, A.: To the international computing community: A new East-West challenge. Technical report, Oxford University Computing laboratory, Oxford, UK (1994)
Google Scholar
Motl, J., Schulte, O.: The ctu prague relational learning repository. arXiv preprint arXiv:1511.03086 (2015)
Nemenyi, P.: Distribution-free multiple comparisons. Biometrics 18, 263 (1962)
Google Scholar
Oracle Corporation: MySQL (2019). https://www.mysql.com/
Paik, J.H.: A novel TF-IDF weighting scheme for effective ranking. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 343–352. SIGIR 2013, ACM, New York, NY, USA (2013). http://doi.acm.org/10.1145/2484028.2484070
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. JMLR 12, 2825–2830 (2011). http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf
MathSciNet MATH Google Scholar
Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B., Lavrač, N.: Wordification: propositionalization by unfolding relational data into bags of words. Expert Syst. Appl. 42(17), 6442–6456 (2015). https://doi.org/10.1016/j.eswa.2015.04.017
Article Google Scholar
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3(4), 333–389 (2009). http://dx.doi.org/10.1561/1500000019
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Others: Okapi at TREC-3. NIST Special Publication 109, 109 (1995)
Google Scholar
Sanner, M.F.: Python: a programming language for software integration and development. J. Molecular Graph. Modell. 17(1), 57–61 (1999). https://www.ncbi.nlm.nih.gov/pubmed/10660911
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283
Article Google Scholar
Singhal, A., et al.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001). http://sifaka.cs.uiuc.edu/course/410s12/mir.pdf
Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation (1972). https://www.emeraldinsight.com/doi/abs/10.1108/eb026526
Trotman, A.: Learning to rank. Inf. Retrieval 8(3), 359–381 (2005). https://doi.org/10.1007/s10791-005-6991-7
Article Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods (1945). http://dx.doi.org/10.2307/3001968
Yang, Y., Liu, X., et al.: A re-examination of text categorization methods. In: Sigir. vol. 99, p. 99 (1999). http://people.csail.mit.edu/jim/temp/yang.pdf
Železný, F., Lavrač, N.: Propositionalization-based relational subgroup discovery with RSD. Mach. Learn. 62(1), 33–63 (2006). https://doi.org/10.1007/s10994-006-5834-0
Article Google Scholar
Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert Syst. Appl. 82, 128–150 (2017). https://doi.org/10.1016/j.eswa.2017.04.003
Article Google Scholar

Download references

Acknowledgment

We would like to thank the financial support of the Brazilian Research Agencies CAPES and CNPq, Janez Kranjc for some clarifications regarding Clowdflow, and to all the authors of Wordification for making it available.

Author information

Authors and Affiliations

COPPE-PESC/Federal University of Rio de Janeiro, Rio de Janeiro, RJ, 21941-972, Brazil
Tatiana Sciammarella & Gerson Zaverucha

Authors

Tatiana Sciammarella
View author publications
You can also search for this author in PubMed Google Scholar
Gerson Zaverucha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tatiana Sciammarella or Gerson Zaverucha .

Editor information

Editors and Affiliations

Department of Computer Science, University of York, Heslington, UK
Dimitar Kazakov
Department of Computer Science, University of York, Heslington, UK
Can Erten

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sciammarella, T., Zaverucha, G. (2020). Weight Your Words: The Effect of Different Weighting Schemes on Wordification Performance. In: Kazakov, D., Erten, C. (eds) Inductive Logic Programming. ILP 2019. Lecture Notes in Computer Science(), vol 11770. Springer, Cham. https://doi.org/10.1007/978-3-030-49210-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-49210-6_10
Published: 05 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49209-0
Online ISBN: 978-3-030-49210-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics