Skip to main content

Generating Term Weighting Schemes Through Genetic Programming

  • Conference paper
  • First Online:
Machine Learning, Optimization, and Data Science (LOD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11331))

  • 2024 Accesses

Abstract

Term-Weighting Scheme (TWS) is an important step in text classification. It determines how documents are represented in Vector Space Model (VSM). Even though state-of-the-art TWSs exhibit good behaviors, a large number of new works propose new approaches and new TWSs that improve performances. Furthermore, it is still difficult to tell which TWS is well suited for a specific problem. In this paper, we are interested in automatically generating new TWSs with the help of evolutionary algorithms and especially genetic programming (GP). GP evolves and combines different statistical information and generates a new TWS based on the performance of the learning method. We experience the generated TWSs on three well-known benchmarks. Our study shows that even early generated formulas are quite competitive with the state-of-the-art TWSs and even in some cases outperform them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://eigen.tuxfamily.org/.

  2. 2.

    https://bitbucket.org/mazyad/eigennlp.

  3. 3.

    http://disi.unitn.it/moschitti/corpora.htm.

  4. 4.

    http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/.

References

  1. Cazenave, T.: Nested Monte-Carlo expression discovery. In: ECAI, pp. 1057–1058 (2010)

    Google Scholar 

  2. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)

    MATH  Google Scholar 

  3. Cramer, N.L.: A representation for the adaptive generation of simple sequential programs. In: Proceedings of the First International Conference on Genetic Algorithms, pp. 183–187 (1985)

    Google Scholar 

  4. Cummins, R., O’Riordan, C.: Evolving general term-weighting schemes for information retrieval: tests on larger collections. Artif. Intell. Rev. 24(3–4), 277–299 (2005)

    Article  Google Scholar 

  5. Cummins, R., O’Riordan, C.: Evolved term-weighting schemes in information retrieval: an analysis of the solution space. Artif. Intell. Rev. 26(1–2), 35–47 (2006)

    Article  Google Scholar 

  6. Cummins, R., O’Riordan, C.: Evolving local and global weighting schemes in information retrieval. Inf. Retr. 9(3), 311–330 (2006)

    Article  Google Scholar 

  7. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text mining and its applications. STUDFUZZ, pp. 81–97. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-45219-5_7

    Chapter  Google Scholar 

  8. Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Li, M.Z.L.-Y., Xie, K.-Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24655-8_64

    Chapter  Google Scholar 

  9. Escalante, H.J., et al.: Term-weighting learning via genetic programming for text classification. Knowl.-Based Syst. 83, 176–189 (2015)

    Article  Google Scholar 

  10. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9(Aug), 1871–1874 (2008)

    MATH  Google Scholar 

  11. Fan, W., Fox, E.A., Pathak, P., Wu, H.: The effects of fitness functions on genetic programming-based ranking discovery for web search. J. Assoc. Inf. Sci. Technol. 55(7), 628–636 (2004)

    Article  Google Scholar 

  12. Guru, D., Suhil, M.: A novel term class relevance measure for text categorization. Proc. Comput. Sci. 45, 13–22 (2015)

    Article  Google Scholar 

  13. Ibrahim, O.A.S., Landa-Silva, D.: Term frequency with average term occurrences for textual information retrieval. Soft Comput. 20(8), 3045–3061 (2016)

    Article  Google Scholar 

  14. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683

    Chapter  Google Scholar 

  15. Kadhim, A.I.: Statistical computation and term weighting for feature extraction on Twitter. In: 2018 International Conference on Advance of Sustainable Engineering and its Application (ICASEA), pp. 109–114, March 2018

    Google Scholar 

  16. Karakus, M.: Function identification for the intrinsic strength and elastic properties of granitic rocks via genetic programming (GP). Comput. Geosci. 37(9), 1318–1323 (2011)

    Article  Google Scholar 

  17. Koza, J.R.: Concept formation and decision tree induction using the genetic programming paradigm. In: Schwefel, H.-P., Männer, R. (eds.) PPSN 1990. LNCS, vol. 496, pp. 124–128. Springer, Heidelberg (1991). https://doi.org/10.1007/BFb0029742

    Chapter  Google Scholar 

  18. Koza, J.R.: Genetic Programming II, Automatic Discovery of Reusable Subprograms. MIT Press, Cambridge (1992)

    Google Scholar 

  19. Koza, J.R.: Genetic programming: on the Programming of Computers by Means of Natural Selection, vol. 1. MIT Press, Cambridge (1992)

    MATH  Google Scholar 

  20. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  21. Lewis, M.A., Fagg, A.H., Solidum, A.: Genetic programming approach to the construction of a neural network for control of a walking robot. In: IEEE International Conference on Robotics and Automation, vol. 3, pp. 2618–2623 (1992)

    Google Scholar 

  22. Mazyad, A., Teytaud, F., Fonlupt, C.: A comparative study on term weighting schemes for text classification. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds.) MOD 2017. LNCS, vol. 10710, pp. 100–108. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72926-8_9

    Chapter  Google Scholar 

  23. Mazyad, A., Teytaud, F., Fonlupt, C.: Information gain based term weighting method for multi-label text classification task. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2018. AISC, vol. 868, pp. 607–615. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01054-6_44

    Chapter  Google Scholar 

  24. Mladeni’c, D., Grobelnik, M.: Feature selection for classification based on text hierarchy. In: Text and the Web, Conference on Automated Learning and Discovery CONALD-98. Citeseer (1998)

    Google Scholar 

  25. Oren, N.: Reexamining tf.idf based information retrieval with genetic programming. In: Proceedings of the 2002 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists on Enablement Through Technology, pp. 224–234. South African Institute for Computer Scientists and Information Technologists (2002)

    Google Scholar 

  26. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  27. Searson, D.P., Leahy, D.E., Willis, M.J.: GPTIPS: an open source genetic programming toolbox for multigene symbolic regression. In: Proceedings of the International Multiconference of Engineers and Computer Scientists, vol. 1, pp. 77–80. Citeseer (2010)

    Google Scholar 

  28. Tretyakov, K.: Machine learning techniques in spam filtering. In: Data Mining Problem-Oriented Seminar, MTAT, vol. 3, pp. 60–79 (2004)

    Google Scholar 

  29. Trotman, A.: Learning to rank. Inf. Retr. 8(3), 359–381 (2005)

    Article  Google Scholar 

  30. Wang, D., Zhang, H.: Inverse category frequency based supervised term weighting scheme for text categorization. preprint arXiv:1012.2609v4 (2013)

  31. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)

    Google Scholar 

  32. Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabien Teytaud .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mazyad, A., Teytaud, F., Fonlupt, C. (2019). Generating Term Weighting Schemes Through Genetic Programming. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R., Sciacca, V. (eds) Machine Learning, Optimization, and Data Science. LOD 2018. Lecture Notes in Computer Science(), vol 11331. Springer, Cham. https://doi.org/10.1007/978-3-030-13709-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-13709-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-13708-3

  • Online ISBN: 978-3-030-13709-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics