Beyond the Bag of Words: A Text Representation for Sentence Selection

Caropreso, Maria Fernanda; Matwin, Stan

doi:10.1007/11766247_28

Maria Fernanda Caropreso²⁰ &
Stan Matwin^20,21

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4013))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

2726 Accesses
4 Citations

Abstract

Sentence selection shares some but not all the characteristics of Automatic Text Categorization. Therefore some but not all the same techniques should be used. In this paper we study a syntactic and semantic enriched text representation for the sentence selection task in a genomics corpus. We show that using technical dictionaries and syntactic relations is beneficial for our problem when using state of the art machine learning algorithms. Furthermore, the syntactic relations can be used by a first order rule learner to obtain even better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Chin, A.G. (ed.) Text Databases and Document Management: Theory and Practice, pp. 78–102. Idea Group Publishing, Hershey (2001)
Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization in ACM Trans. Inf. Syst. 17(2), 141–173 (1999)
Google Scholar
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, Bethesda, US, pp. 148–155. ACM Press, New York (1998)
Chapter Google Scholar
Fagan, J.L.: Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods. PhD thesis, Department of Computer Science, Cornell University, Ithaca, US (1987)
Google Scholar
Furnkranz, J.: A study using n-gram features for text categorization. Technical Report TR-98-30, Oesterreichisches Forschungsinstitut Artificial Intelligence, Wien, AT (1998)
Google Scholar
Furnkranz, J., Mitchell, T.M., Rilo®, E.: A case study in using linguistic phrases for text categorization on the WWW. In: Proceedings of the 1st AAAI Workshop on Learning for Text Categorization, Madison, US, pp. 5–12 (1998)
Google Scholar
Furnkranz, J.: Inductive Logic Programming (a short introduction and a thesis abstract)
Google Scholar
Goadrich, M., Oliphant, L., Shavlik, J.: Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. In: Proceedings of the Fourteenth International Conference on Inductive Logic Programming, Porto, Portugal (2004)
Google Scholar
Kramer, S.: Relational Learning vs. Propositionalization. PhD. Thesis, Vienna University of Technology, Vienna, Austria (1999)
Google Scholar
Lewis, D.D., Croft, W.B.: Term clustering of syntactic phrases. In: Proceedings of SIGIR-1990, 13th ACM International Conference on Research and Development in Information Retrieval, Bruxelles, BE, pp. 385–404 (1990)
Google Scholar
Lewis, D.D.: Representation and Learning in Information Retrieval, Ph.D. dissertation, University of Massachusetts (1992)
Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Belkin, N.J., Ingwersen, P., Pejtersen, A.M. (eds.) Proceedings of SIGIR-1992, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, pp. 37–50. ACM Press, New York (1992)
Chapter Google Scholar
Mladenic, D., Grobelnik, M.: Word sequences as features in text learning. In: Proceedings of ERK-1998, the seventh Electrotecnical and Computer Science Conference (pp, Ljubljana, Slovenia, pp. 145–148 (1998)
Google Scholar
Mitra, M., Buckley, C., Singhal, A., Cardie, C.: An Analysis of Statistical and Syntactic Phrases. In: 5TH RIAO Conference, Computer-Assisted Information Searching On the Internet, pp. 200–214 (1997)
Google Scholar
Nédellec, C., Vetah, M.O.A., Bessières, P.: Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, Springer, Heidelberg (2001)
Chapter Google Scholar
Ould, M.: Apprentissage Automatique Applique a l’Extraction d’Information a Partir de Textes Biologiques. PhD Thesis. L’Universite Paris-Sud. France (2005)
Google Scholar
Ould, M., Caropreso, F., Manine, P., Nedellec, C., Matwin, S.: Sentence Categorization in Genomics Bibliography: a Naïve Bayes Approach. Informatique pour lèanalyse du transcriptome, Paris (2003)
Google Scholar
Ray, S., Craven, M.: Representing Sentence Structure in Hidden Markov Models for Information Extraction. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001) (2001)
Google Scholar
Scott, S., Matwin, S.: Feature Engineering for Text Classification. In: Proceedings of ICML-1999, 16th International Conference on Machine Learning (1999)
Google Scholar
Siolas, G.: Modèles probabilistes et noyaux pour l’extraction d’informations à partir de documents, Paris, Thèse de doctorat de l’Université (July 6, 2003)
Google Scholar
Srinivasan, A.: The Aleph Manual (1993), http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph_toc.html
Sleator, D., Temperley, D.: Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report CMU-CS-91-196 (October 1991)
Google Scholar
Temkin, J.M., Gilder, M.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16), 2046–2053 (2003)
Article Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedings of CIKM-01, 10th ACM International Conference on Information and Knowledge Management (2001)
Google Scholar
Fillmore, C.J.: The Case for Case. In: Bach, Harms (eds.) Universals in Linguistic Theory, pp. 1–88. Holt, Rinehart, and Winston, New York (1968)
Google Scholar
Maarek, Y., Berry, D.M., Kaiser, G.E.: GURU: Information Retrieval for Reuse. In: Hall, P. (ed.) Landmark Contributions in Software Reuse and Reverse Engineering (1994)
Google Scholar
Jacquemin, C.: What is the tree that we see through the window: A linguistic approach to windowing and term variation. Information Processing and Management 32(4), 445–458 (1996)
Article Google Scholar
Hachey, B., Grover, C.: Sequence Modelling for Sentence Classification in a Legal Summarisation System. In: Proceedings of the 2005 ACM Symposium on Applied Computing (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology and Engineering., University of Ottawa, Ottawa, Ontario, K1N 6N5
Maria Fernanda Caropreso & Stan Matwin
Institute for Computer Science, Polish Academy of Science, Wavsaw
Stan Matwin

Authors

Maria Fernanda Caropreso
View author publications
You can also search for this author in PubMed Google Scholar
Stan Matwin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departement of Computer Science and Software Engineering, Laval University, G1K 7P4, Québec, Canada
Luc Lamontagne
Département IFT-GLO, Pavillon Adrien-Pouliot, Université Laval, G1K-7P4, Québec, Canada
Mario Marchand

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Caropreso, M.F., Matwin, S. (2006). Beyond the Bag of Words: A Text Representation for Sentence Selection. In: Lamontagne, L., Marchand, M. (eds) Advances in Artificial Intelligence. Canadian AI 2006. Lecture Notes in Computer Science(), vol 4013. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11766247_28

Download citation

DOI: https://doi.org/10.1007/11766247_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34628-9
Online ISBN: 978-3-540-34630-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics