Abstract
Sentence selection shares some but not all the characteristics of Automatic Text Categorization. Therefore some but not all the same techniques should be used. In this paper we study a syntactic and semantic enriched text representation for the sentence selection task in a genomics corpus. We show that using technical dictionaries and syntactic relations is beneficial for our problem when using state of the art machine learning algorithms. Furthermore, the syntactic relations can be used by a first order rule learner to obtain even better performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Caropreso, M.F., Matwin, S., Sebastiani, F.: A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In: Chin, A.G. (ed.) Text Databases and Document Management: Theory and Practice, pp. 78–102. Idea Group Publishing, Hershey (2001)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization in ACM Trans. Inf. Syst.  17(2), 141–173 (1999)
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, Bethesda, US, pp. 148–155. ACM Press, New York (1998)
Fagan, J.L.: Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and non-syntactic methods. PhD thesis, Department of Computer Science, Cornell University, Ithaca, US (1987)
Furnkranz, J.: A study using n-gram features for text categorization. Technical Report TR-98-30, Oesterreichisches Forschungsinstitut Artificial Intelligence, Wien, AT (1998)
Furnkranz, J., Mitchell, T.M., Rilo®, E.: A case study in using linguistic phrases for text categorization on the WWW. In: Proceedings of the 1st AAAI Workshop on Learning for Text Categorization, Madison, US, pp. 5–12 (1998)
Furnkranz, J.: Inductive Logic Programming (a short introduction and a thesis abstract)
Goadrich, M., Oliphant, L., Shavlik, J.: Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. In: Proceedings of the Fourteenth International Conference on Inductive Logic Programming, Porto, Portugal (2004)
Kramer, S.: Relational Learning vs. Propositionalization. PhD. Thesis, Vienna University of Technology, Vienna, Austria (1999)
Lewis, D.D., Croft, W.B.: Term clustering of syntactic phrases. In: Proceedings of SIGIR-1990, 13th ACM International Conference on Research and Development in Information Retrieval, Bruxelles, BE, pp. 385–404 (1990)
Lewis, D.D.: Representation and Learning in Information Retrieval, Ph.D. dissertation, University of Massachusetts (1992)
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Belkin, N.J., Ingwersen, P., Pejtersen, A.M. (eds.) Proceedings of SIGIR-1992, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, pp. 37–50. ACM Press, New York (1992)
Mladenic, D., Grobelnik, M.: Word sequences as features in text learning. In: Proceedings of ERK-1998, the seventh Electrotecnical and Computer Science Conference (pp, Ljubljana, Slovenia, pp. 145–148 (1998)
Mitra, M., Buckley, C., Singhal, A., Cardie, C.: An Analysis of Statistical and Syntactic Phrases. In: 5TH RIAO Conference, Computer-Assisted Information Searching On the Internet, pp. 200–214 (1997)
Nédellec, C., Vetah, M.O.A., Bessières, P.: Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, Springer, Heidelberg (2001)
Ould, M.: Apprentissage Automatique Applique a l’Extraction d’Information a Partir de Textes Biologiques. PhD Thesis. L’Universite Paris-Sud. France (2005)
Ould, M., Caropreso, F., Manine, P., Nedellec, C., Matwin, S.: Sentence Categorization in Genomics Bibliography: a Naïve Bayes Approach. Informatique pour lèanalyse du transcriptome, Paris (2003)
Ray, S., Craven, M.: Representing Sentence Structure in Hidden Markov Models for Information Extraction. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001) (2001)
Scott, S., Matwin, S.: Feature Engineering for Text Classification. In: Proceedings of ICML-1999, 16th International Conference on Machine Learning (1999)
Siolas, G.: Modèles probabilistes et noyaux pour l’extraction d’informations à partir de documents, Paris, Thèse de doctorat de l’Université (July 6, 2003)
Srinivasan, A.: The Aleph Manual (1993), http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph_toc.html
Sleator, D., Temperley, D.: Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report CMU-CS-91-196 (October 1991)
Temkin, J.M., Gilder, M.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16), 2046–2053 (2003)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedings of CIKM-01, 10th ACM International Conference on Information and Knowledge Management (2001)
Fillmore, C.J.: The Case for Case. In: Bach, Harms (eds.) Universals in Linguistic Theory, pp. 1–88. Holt, Rinehart, and Winston, New York (1968)
Maarek, Y., Berry, D.M., Kaiser, G.E.: GURU: Information Retrieval for Reuse. In: Hall, P. (ed.) Landmark Contributions in Software Reuse and Reverse Engineering (1994)
Jacquemin, C.: What is the tree that we see through the window: A linguistic approach to windowing and term variation. Information Processing and Management 32(4), 445–458 (1996)
Hachey, B., Grover, C.: Sequence Modelling for Sentence Classification in a Legal Summarisation System. In: Proceedings of the 2005 ACM Symposium on Applied Computing (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Caropreso, M.F., Matwin, S. (2006). Beyond the Bag of Words: A Text Representation for Sentence Selection. In: Lamontagne, L., Marchand, M. (eds) Advances in Artificial Intelligence. Canadian AI 2006. Lecture Notes in Computer Science(), vol 4013. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11766247_28
Download citation
DOI: https://doi.org/10.1007/11766247_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34628-9
Online ISBN: 978-3-540-34630-2
eBook Packages: Computer ScienceComputer Science (R0)