Skip to main content

Weight Your Words: The Effect of Different Weighting Schemes on Wordification Performance

  • Conference paper
  • First Online:
Inductive Logic Programming (ILP 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11770))

Included in the following conference series:

  • 376 Accesses

Abstract

Relational data models are usually used by researchers and companies when a single-table model is not enough to describe their system. Then, when it comes to classification, there are mainly two options: apply the corresponding relational version of classification algorithms or use a propositionalization technique to transform the relational database into a single-table representation before classification. In this work, we evaluate a fast and simple propositionalization algorithm called Wordification. This technique uses the table name, attribute name and value to create a feature. Each feature is treated as a word and the instances of the database are represented by a Bag-Of-Words (BOW) model. Then, a weighting scheme is used to weight the features for each instance. The original implementation of Wordification only explored the TF-IDF, the term-frequency and the binary weighting schemes. However, works in the text classification and data mining fields show that the proper choice of weighting schemes can boost classification. Therefore, we empirically experimented different term weighting approaches with Wordification. Our results show that the right combination of weighting scheme and classification algorithm can significantly improve classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Datasets available at http://kt.ijs.si/janez_kranjc/ilp_datasets/.

  2. 2.

    Datasets available at https://relational.fit.cvut.cz/.

References

  1. Bühlmann, H., Gisler, A.: A Course in Credibility Theory and its Applications. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-29273-X

    Book  MATH  Google Scholar 

  2. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. JMLR 7, 1–30 (2006). http://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf

    MathSciNet  MATH  Google Scholar 

  3. Fayyad, U.M., Irani, K.B.: Multi-Interval discretization of Continuous-Valued attributes for classification learning. In: Proceedings of the Conference of International Joint Conferences on Artificial Intelligence (1993). https://www.semanticscholar.org/paper/1dc53b91327cab503acc0ca5afb9155882b717a5

  4. Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. (2014). http://www.jmlr.org/papers/volume15/delgado14a/delgado14a.pdf

  5. Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937). https://www.tandfonline.com/doi/abs/10.1080/01621459.1937.10503522

    Article  Google Scholar 

  6. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982). http://dx.doi.org/10.1148/radiology.143.1.7063747

    Article  Google Scholar 

  7. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, July 2016. http://arxiv.org/abs/1607.01759

  8. Kim, Y., Zhang, O.: Credibility adjusted term frequency: a supervised term weighting scheme for sentiment analysis and text classification. In: Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, May 2014

    Google Scholar 

  9. Kołcz, A., Teo, C.H.: Feature weighting for improved classifier robustness. In: CEAS 2009: Sixth Conference on Email and Anti-spam (2009)

    Google Scholar 

  10. Kramer, S., Lavrač, N., Flach, P.: Propositionalization approaches to relational data mining. In: Džeroski, S., Lavrač, N. (eds.) Relational Data Mining, pp. 262–291. Springer, Heidelberg (2001), https://doi.org/10.1007/978-3-662-04599-2_11

  11. Kramer, S., Pfahringer, B., Helma, C.: Stochastic propositionalization of non-determinate background knowledge. In: Page, D. (ed.) ILP 1998. LNCS, vol. 1446, pp. 80–94. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0027312

    Chapter  Google Scholar 

  12. Kranjc, J., Podpečan, V., Lavrač, N.: ClowdFlows: A Cloud Based Scientific Workflow Platform. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 816–819. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33486-3_54

    Chapter  Google Scholar 

  13. Kuželka, O., Železný, F.: Block-wise construction of tree-like relational features with monotone reducibility and redundancy. Mach. Learn. 83(2), 163–192 (2011). https://doi.org/10.1007/s10994-010-5208-5

    Article  MathSciNet  Google Scholar 

  14. Lan, M., Tan, C.L., Low, H.B.: Proposing a new term weighting scheme for text categorization. In: AAAI, vol. 6, pp. 763–768 (2006). https://www.aaai.org/Papers/AAAI/2006/AAAI06-121.pdf

  15. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009). https://doi.org/10.1109/TPAMI.2008.110

    Article  Google Scholar 

  16. Lavrač, N., Džeroski, S., Grobelnik, M.: Learning nonrecursive definitions of relations with linus. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS, vol. 482, pp. 265–281. Springer, Heidelberg (1991). https://doi.org/10.1007/BFb0017020

    Chapter  Google Scholar 

  17. Martineau, J.C., Finin, T.: Delta TFIDF: an improved feature space for sentiment analysis. In: Third International AAAI Conference on Weblogs and Social Media (2009). https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewPaper/187

  18. Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Systems Appl. 106, 36–54 (2018). https://doi.org/10.1016/j.eswa.2018.03.058

    Article  Google Scholar 

  19. Michie, D., Muggleton, S., Page, D., Srinivasan, A.: To the international computing community: A new East-West challenge. Technical report, Oxford University Computing laboratory, Oxford, UK (1994)

    Google Scholar 

  20. Motl, J., Schulte, O.: The ctu prague relational learning repository. arXiv preprint arXiv:1511.03086 (2015)

  21. Nemenyi, P.: Distribution-free multiple comparisons. Biometrics 18, 263 (1962)

    Google Scholar 

  22. Oracle Corporation: MySQL (2019). https://www.mysql.com/

  23. Paik, J.H.: A novel TF-IDF weighting scheme for effective ranking. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 343–352. SIGIR 2013, ACM, New York, NY, USA (2013). http://doi.acm.org/10.1145/2484028.2484070

  24. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. JMLR 12, 2825–2830 (2011). http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

    MathSciNet  MATH  Google Scholar 

  25. Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B., Lavrač, N.: Wordification: propositionalization by unfolding relational data into bags of words. Expert Syst. Appl. 42(17), 6442–6456 (2015). https://doi.org/10.1016/j.eswa.2015.04.017

    Article  Google Scholar 

  26. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3(4), 333–389 (2009). http://dx.doi.org/10.1561/1500000019

  27. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Others: Okapi at TREC-3. NIST Special Publication 109, 109 (1995)

    Google Scholar 

  28. Sanner, M.F.: Python: a programming language for software integration and development. J. Molecular Graph. Modell. 17(1), 57–61 (1999). https://www.ncbi.nlm.nih.gov/pubmed/10660911

    Google Scholar 

  29. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002). https://doi.org/10.1145/505282.505283

    Article  Google Scholar 

  30. Singhal, A., et al.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001). http://sifaka.cs.uiuc.edu/course/410s12/mir.pdf

    Google Scholar 

  31. Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation (1972). https://www.emeraldinsight.com/doi/abs/10.1108/eb026526

  32. Trotman, A.: Learning to rank. Inf. Retrieval 8(3), 359–381 (2005). https://doi.org/10.1007/s10791-005-6991-7

    Article  Google Scholar 

  33. Wilcoxon, F.: Individual comparisons by ranking methods (1945). http://dx.doi.org/10.2307/3001968

  34. Yang, Y., Liu, X., et al.: A re-examination of text categorization methods. In: Sigir. vol. 99, p. 99 (1999). http://people.csail.mit.edu/jim/temp/yang.pdf

  35. Železný, F., Lavrač, N.: Propositionalization-based relational subgroup discovery with RSD. Mach. Learn. 62(1), 33–63 (2006). https://doi.org/10.1007/s10994-006-5834-0

    Article  Google Scholar 

  36. Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert Syst. Appl. 82, 128–150 (2017). https://doi.org/10.1016/j.eswa.2017.04.003

    Article  Google Scholar 

Download references

Acknowledgment

We would like to thank the financial support of the Brazilian Research Agencies CAPES and CNPq, Janez Kranjc for some clarifications regarding Clowdflow, and to all the authors of Wordification for making it available.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tatiana Sciammarella or Gerson Zaverucha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sciammarella, T., Zaverucha, G. (2020). Weight Your Words: The Effect of Different Weighting Schemes on Wordification Performance. In: Kazakov, D., Erten, C. (eds) Inductive Logic Programming. ILP 2019. Lecture Notes in Computer Science(), vol 11770. Springer, Cham. https://doi.org/10.1007/978-3-030-49210-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-49210-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-49209-0

  • Online ISBN: 978-3-030-49210-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics