Abstract
Data-driven approaches to sentence compression define the task as dropping any subset of words from the input sentence while retaining important information and grammaticality. We show that only 16% of the observed compressed sentences in the domain of subtitling can be accounted for in this way. We argue that this is partly due to the lack of appropriate evaluation material and estimate that a deletion model is in fact compatible with approximately 55% of the observed data. We analyse the remaining cases in which deletion only failed to provide the required level of compression. We conclude that in those cases word order changes and paraphrasing are crucial. We therefore argue for more elaborate sentence compression models which include paraphrasing and word reordering. We report preliminary results of applying a recently proposed more powerful compression model in the context of subtitling for Dutch.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Morristown, NJ, USA, pp. 16–23 (2003)
Belz, A., Reiter, E.: Comparing automatic and human evaluation of NLG systems. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 313–320 (2006)
Bouma, G., van Noord, G., Malouf, R.: Alpino: Wide-coverage computational analysis of Dutch. In: Daelemans, W., Sima’an, K., Veenstra, J., Zavre, J., et al. (eds.) Computational Linguistics in the Netherlands 2000. Selected Papers from the Eleventh CLIN Meeting, Rodopi, Amsterdam, New York, pp. 45–59 (2001)
Clarke, J., Lapata, M.: Models for sentence compression: a comparison across domains, training requirements and evaluation measures. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Morristown, NJ, USA, pp. 377–384 (2006)
Clarke, J., Lapata, M.: Global inference for sentence compression an integer linear programming approach. Journal of Artificial Intelligence Research 31, 399–429 (2008)
Cohn, T., Lapata, M.: Sentence compression beyond word deletion. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 137–144. Association for Computational Linguistics (2008)
Cohn, T., Lapata, M.: Sentence compression as tree transduction. J. Artif. Int. Res. 34(1), 637–674 (2009)
Corston-Oliver, S.: Text compaction for display on very small screens. In: Proceedings of the Workshop on Automatic Summarization (WAS 2001), Pittsburgh, PA, USA, pp. 89–98 (2001)
Daelemans, W., Höthker, A., Tjong Kim Sang, E.: Automatic sentence simplification for subtitling in Dutch and English. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, pp. 1045–1048 (2004)
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, Morristown, NJ, USA, pp. 350–356 (2004)
Eisner, J.: Learning non-isomorphic tree mappings for machine translation. In: Proceedings of 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 205–208 ( July 2003)
Filippova, K., Strube, M.: Sentence fusion via dependency graph compression. In: EMNLP 2008: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 177–185. Association for Computational Linguistics, Morristown (2008)
Filippova, K., Strube, M.: Tree linearization in English: improving language model based approaches. In: NAACL 2009: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 225–228. Association for Computational Linguistics, Morristown (2009) (Companion Volume: Short Papers)
Gatt, A., Belz, A.: Attribute selection for referring expression generation: New algorithms and evaluation methods. In: Proceedings of the Fifth International Natural Language Generation Conference, pp. 50–58. Association for Computational Linguistics, Columbus (2008)
Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the 2nd International Workshop on Paraphrasing, Sapporo, Japan, vol. 16, pp. 57–64 (2003)
Inui, K., Tokunaga, T., Tanaka, H.: Text revision: A model and its implementation. In: Proceedings of the 6th International Workshop on Natural Language Generation: Aspects of Automated Natural Language Generation, pp. 215–230. Springer, London (1992)
Jing, H., McKeown, K.: Cut and paste based text summarization. In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics, San Francisco, CA, USA, pp. 178–185 (2000)
Knight, K., Marcu, D.: Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence 139(1), 91–107 (2002)
Le, N.M., Horiguchi, S.: A new sentence reduction based on decision tree model. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 290–297 (2003)
Lin, C.Y.: Improving summarization performance by sentence compression - A pilot study. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, vol. 2003, pp. 1–9 (2003)
Lin, D., Pantel, P.: Discovery of inference rules for question answering. Natural Language Engineering 7(4), 343–360 (2001)
Marsi, E., Krahmer, E.: Annotating a parallel monolingual treebank with semantic similarity relations. In: Proceedings of the 6th International Workshop on Treebanks and Linguistic Theories, Bergen, Norway, pp. 85–96 (2007)
Marsi, E., Krahmer, E.: Detecting semantic overlap: A parallel monolingual treebank for Dutch. In: Verberne, S., van Halteren, H., Coppen, P.A. (eds.) Computational Linguistics in the Netherlands (CLIN 2007): Selected papers from the 18th meeting, Rodopi, Amsterdam, pp. 69–84 (2008)
Nomoto, T.: A Comparison of Model Free versus Model Intensive Approaches to Sentence Compression. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 391–399 (2009)
Ordelman, R., de Jong, F., van Hessen, A., Hondorp, H.: Twnc: a multifaceted Dutch news corpus. ELRA Newsletter 12(3/4), 4–7 (2007)
Turner, J., Charniak, E.: Supervised and unsupervised learning for sentence compression. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Michigan, pp. 290–297 (June 2005)
Vandeghinste, V., Pan, Y.: Sentence compression for automated subtitling: A hybrid approach. In: Proceedings of the ACL Workshop on Text Summarization, pp. 89–95 (2004)
Vandeghinste, V., Tjong Kim Sang, E.: Using a Parallel Transcript/Subtitle Corpus for Sentence Compression. In: Proceedings of LREC 2004 (2004)
Wan, S., Dras, M., Dale, R., Paris, C.: Spanning tree approaches for statistical sentence generation. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 13–44. Springer, Heidelberg (2010)
Zajic, D., Dorr, B.J., Lin, J., Schwartz, R.: Multi-candidate reduction: Sentence compression as a tool for document summarization tasks. Information Processing Management 43(6), 1549–1570 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Marsi, E., Krahmer, E., Hendrickx, I., Daelemans, W. (2010). On the Limits of Sentence Compression by Deletion. In: Krahmer, E., Theune, M. (eds) Empirical Methods in Natural Language Generation. EACL ENLG 2009 2009. Lecture Notes in Computer Science(), vol 5790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15573-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-15573-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15572-7
Online ISBN: 978-3-642-15573-4
eBook Packages: Computer ScienceComputer Science (R0)