Skip to main content

Beyond the Bag of Words: A Text Representation for Sentence Selection

  • Conference paper
Advances in Artificial Intelligence (Canadian AI 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4013))

Abstract

Sentence selection shares some but not all the characteristics of Automatic Text Categorization. Therefore some but not all the same techniques should be used. In this paper we study a syntactic and semantic enriched text representation for the sentence selection task in a genomics corpus. We show that using technical dictionaries and syntactic relations is beneficial for our problem when using state of the art machine learning algorithms. Furthermore, the syntactic relations can be used by a first order rule learner to obtain even better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Chin, A.G. (ed.) Text Databases and Document Management: Theory and Practice, pp. 78–102. Idea Group Publishing, Hershey (2001)

    Google Scholar 

  2. Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization in ACM Trans. Inf. Syst.  17(2), 141–173 (1999)

    Google Scholar 

  3. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, Bethesda, US, pp. 148–155. ACM Press, New York (1998)

    Chapter  Google Scholar 

  4. Fagan, J.L.: Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods. PhD thesis, Department of Computer Science, Cornell University, Ithaca, US (1987)

    Google Scholar 

  5. Furnkranz, J.: A study using n-gram features for text categorization. Technical Report TR-98-30, Oesterreichisches Forschungsinstitut Artificial Intelligence, Wien, AT (1998)

    Google Scholar 

  6. Furnkranz, J., Mitchell, T.M., Rilo®, E.: A case study in using linguistic phrases for text categorization on the WWW. In: Proceedings of the 1st AAAI Workshop on Learning for Text Categorization, Madison, US, pp. 5–12 (1998)

    Google Scholar 

  7. Furnkranz, J.: Inductive Logic Programming (a short introduction and a thesis abstract)

    Google Scholar 

  8. Goadrich, M., Oliphant, L., Shavlik, J.: Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. In: Proceedings of the Fourteenth International Conference on Inductive Logic Programming, Porto, Portugal (2004)

    Google Scholar 

  9. Kramer, S.: Relational Learning vs. Propositionalization. PhD. Thesis, Vienna University of Technology, Vienna, Austria (1999)

    Google Scholar 

  10. Lewis, D.D., Croft, W.B.: Term clustering of syntactic phrases. In: Proceedings of SIGIR-1990, 13th ACM International Conference on Research and Development in Information Retrieval, Bruxelles, BE, pp. 385–404 (1990)

    Google Scholar 

  11. Lewis, D.D.: Representation and Learning in Information Retrieval, Ph.D. dissertation, University of Massachusetts (1992)

    Google Scholar 

  12. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Belkin, N.J., Ingwersen, P., Pejtersen, A.M. (eds.) Proceedings of SIGIR-1992, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, pp. 37–50. ACM Press, New York (1992)

    Chapter  Google Scholar 

  13. Mladenic, D., Grobelnik, M.: Word sequences as features in text learning. In: Proceedings of ERK-1998, the seventh Electrotecnical and Computer Science Conference (pp, Ljubljana, Slovenia, pp. 145–148 (1998)

    Google Scholar 

  14. Mitra, M., Buckley, C., Singhal, A., Cardie, C.: An Analysis of Statistical and Syntactic Phrases. In: 5TH RIAO Conference, Computer-Assisted Information Searching On the Internet, pp. 200–214 (1997)

    Google Scholar 

  15. Nédellec, C., Vetah, M.O.A., Bessières, P.: Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  16. Ould, M.: Apprentissage Automatique Applique a l’Extraction d’Information a Partir de Textes Biologiques. PhD Thesis. L’Universite Paris-Sud. France (2005)

    Google Scholar 

  17. Ould, M., Caropreso, F., Manine, P., Nedellec, C., Matwin, S.: Sentence Categorization in Genomics Bibliography: a Naïve Bayes Approach. Informatique pour lèanalyse du transcriptome, Paris (2003)

    Google Scholar 

  18. Ray, S., Craven, M.: Representing Sentence Structure in Hidden Markov Models for Information Extraction. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001) (2001)

    Google Scholar 

  19. Scott, S., Matwin, S.: Feature Engineering for Text Classification. In: Proceedings of ICML-1999, 16th International Conference on Machine Learning (1999)

    Google Scholar 

  20. Siolas, G.: Modèles probabilistes et noyaux pour l’extraction d’informations à partir de documents, Paris, Thèse de doctorat de l’Université (July 6, 2003)

    Google Scholar 

  21. Srinivasan, A.: The Aleph Manual (1993), http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph_toc.html

  22. Sleator, D., Temperley, D.: Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report CMU-CS-91-196 (October 1991)

    Google Scholar 

  23. Temkin, J.M., Gilder, M.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16), 2046–2053 (2003)

    Article  Google Scholar 

  24. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  25. Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedings of CIKM-01, 10th ACM International Conference on Information and Knowledge Management (2001)

    Google Scholar 

  26. Fillmore, C.J.: The Case for Case. In: Bach, Harms (eds.) Universals in Linguistic Theory, pp. 1–88. Holt, Rinehart, and Winston, New York (1968)

    Google Scholar 

  27. Maarek, Y., Berry, D.M., Kaiser, G.E.: GURU: Information Retrieval for Reuse. In: Hall, P. (ed.) Landmark Contributions in Software Reuse and Reverse Engineering (1994)

    Google Scholar 

  28. Jacquemin, C.: What is the tree that we see through the window: A linguistic approach to windowing and term variation. Information Processing and Management 32(4), 445–458 (1996)

    Article  Google Scholar 

  29. Hachey, B., Grover, C.: Sequence Modelling for Sentence Classification in a Legal Summarisation System. In: Proceedings of the 2005 ACM Symposium on Applied Computing (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Caropreso, M.F., Matwin, S. (2006). Beyond the Bag of Words: A Text Representation for Sentence Selection. In: Lamontagne, L., Marchand, M. (eds) Advances in Artificial Intelligence. Canadian AI 2006. Lecture Notes in Computer Science(), vol 4013. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11766247_28

Download citation

  • DOI: https://doi.org/10.1007/11766247_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-34628-9

  • Online ISBN: 978-3-540-34630-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics