Abstract
We present an attempt to use a large amount of summaries contained in the New York Times Annotated Corpus (NYTAC). We introduce five methods inspired by domain adaptation techniques in other research areas to train our supervised summarization system and evaluate them on three test sets. Among the five methods, the one that is trained on the NYTAC followed by fine-tuning on the target data (i.e. the three test sets; DUC2002, RSTDTB\(_{\text {long}}\) and RSTDTB\(_{\text {short}}\)) performs the best for all the test sets. We also propose an instance selection method according to the faithfulness of the extractive oracle summary to the reference summary and empirically show that it improves summarization performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The current datasets for multi-document summarization are also small.
- 2.
In this paper, the (extractive) oracle summary is defined to be the best possible summary that can be generated by sentence extraction, and the reference summary is defined to be the original human-written summary in NYTAC.
- 3.
- 4.
2016 The New York Times Annotated Corpus, used with permission.
- 5.
When the benefit of each sentence is represented as the dot product of a weight vector and a feature vector, the benefit can be negative. The optimization problem with negative benefits cannot be regarded as a KP. However, such cases are very rare and can be ignored in practice.
- 6.
Some reference summaries contain more than 100 words. Such summaries as well as system summaries were truncated to 100 words during the evaluation.
- 7.
With options “-a -x -n 1 -m -s” on version 1.5.5 of the official ROUGE script.
- 8.
With options “-a -x -n 2 -m” on version 1.5.5 of the official ROUGE script.
- 9.
For the statistical significance test, we used Wilcoxon signed-rank test (\(p\le 0.05\)).
- 10.
The selected values of thr for each fold were 0.1, 0.1, 0.1, 0.1 and 0.1, respectively.
- 11.
Explanation of these two types of summaries can be found in the book written by Nenkova and McKeown [23]. We quote the relevant part of the book: A summary that enables the reader to determine about-ness has often been called an indicative summary, while one that can be read in place of the document has been called an informative summary.
- 12.
The selected values of thr for each fold are 0.3, 0.6, 0.3, 0.3 and 0.6, respectively.
References
Almeida, M., Martins, A.: Fast and robust compressive summarization with dual decomposition and multi-task learning. In: Proceedings of ACL 2013, pp. 196–206 (2013)
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP 2011, pp. 355–362 (2011)
Biçici, E.: Domain adaptation for machine translation with instance selection. Prague Bull. Math. Linguist. 103, 5–20 (2015)
Carlson, L., Marcu, D., Okurowski, M.E.: RST discourse treebank. In: Linguistic Data Consortium (2002). https://catalog.ldc.upenn.edu/LDC2002T07
Consortium, L.D: Hansard corpus of parallel english and french. In: Linguistic Data Consortium (1997). http://www.ldc.upenn.edu/
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)
Crammer, K., McDonald, R., Pereira, F.: Scalable large-margin online learning for structured classification. In: Proceedings of NIPS05 Workshop on Learning With Structured Outputs (2005)
Daumé III., H.: Frustratingly easy domain adaptation. In: Proceedings of ACL 2007, pp. 256–263 (2007)
Daumé, H., Marcu, D.: Induction of word and phrase alignments for automatic document summarization. Comput. Linguist. 31(4), 505–530 (2005)
DUC: Document understanding conference. In: ACL Workshop on Automatic Summarization (2002)
Hirao, T., Isozaki, H., Maeda, E., Matsumoto, Y.: Extracting important sentences with support vector machines. In: Proceedings of COLING 2002, vol. 1, pp. 1–7 (2002)
Hirao, T., Yoshida, Y., Nishino, M., Yasuda, N., Nagata, M.: Single-document summarization as a tree knapsack problem. In: Proceedings of EMNLP 2013, pp. 1515–1520 (2013)
Hong, K., Nenkova, A.: Improving the estimation of word importance for news multi-document summarization. In: Proceedings of EACL 2014, pp. 712–721 (2014)
Jing, H., McKeown, K.R.: Cut and paste based text summarization. In: Proceedings of NAACL 2000, pp. 178–185 (2000)
Li, C., Liu, Y., Zhao, L.: Using external resources and joint learning for bigram weighting in ILP-based multi-document summarization. In: Proceedings of NAACL 2015, pp. 778–787 (2015)
Li, C., Qian, X., Liu, Y.: Using supervised bigram-based ILP for extractive summarization. In: Proceedings of ACL 2013, pp. 1004–1013 (2013)
Li, J.J., Nenkova, A.: Fast and accurate prediction of sentence specificity. In: Proceedings of AAA 2015, pp. 2281–2287 (2015)
Li, Q.: Literature survey: domain adaptation algorithms for natural language processing. Technical report, Department of Computer Science. The Graduate Center, The City University of New York (2012)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, pp. 74–81 (2004)
Marcu, D.: Improving summarization through rhetorical parsing tuning. In: Proceedings of Sixth Workshop on Very Large Corpora, pp. 206–215 (1998)
Marcu, D.: The automatic construction of large-scale corpora for summarization research. In: Proceedings of SIGIR99, pp. 137–144 (1999)
Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. In: Proceedings of EMNLP 2004, pp. 404–411 (2004)
Nenkova, A., McKeown, K.: Automatic summarization. Found. Trends. Inf. Retrieval 2–3, 103–233 (2011)
Nishikawa, H., Arita, K., Tanaka, K., Hirao, T., Makino, T., Matsuo, Y.: Learning to generate coherent summary with discriminative hidden semi-Markov model. In: Proceedings of COLING 2014, pp. 1648–1659 (2014)
Remus, R.: Domain adaptation using domain similarity- and domain complexity-based instance selection for cross-domain sentiment analysis. In: Proceedings of ICDMW 2012) Workshop on SENTIRE, pp. 717–723 (2012)
Sandhaus, E.: The New York Times annotated corpus. In: Linguistic Data Consortium (2008). https://catalog.ldc.upenn.edu/LDC2008T19
Sipos, R., Shivaswamy, P., Joachims, T.: Large-margin learning of submodular summarization models. In: Proceedings of EACL 2012, pp. 224–233 (2012)
Svore, K., Vanderwende, L., Burges, C.: Enhancing single-document summarization by combining RankNet and third-party sources. In: Proceedings of EMNLP-CoNLL 2007, Association for Computational Linguistics, Prague, Czech Republic, pp. 448–457. http://www.aclweb.org/anthology/D/D07/D07-1047
Takamura, H., Okumura, M.: Learning to generate summary as structured output. In: Proceedings of CIKM 2010, pp. 1437–1440 (2010)
Xia, R., Zong, C., Hu, X., Cambria, E.: Feature ensemble plus sample selection: domain adaptation for sentiment classification. IEEE Intell. Syst. 28(3), 10–18 (2013)
Yang, Y., Nenkova, A.: Detecting information-dense texts in multiple news domains. In: Proceedings of AAAI 2014, pp. 1650–1656 (2014)
Yih, W.T., Goodman, J., Vanderwende, L., Suzuki, H.: Multi-document summarization by maximizing informative content-words. In: Proceedings of IJCAI 2007, pp. 1776–1782 (2007)
Zhao, J., Qiu, X., Liu, Z., Huang, X.: Online distributed passive-aggressive algorithm for structured learning. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds.) CCL and NLP-NABD 2013. LNCS, vol. 8202, pp. 120–130. Springer, Heidelberg (2013)
Acknowledgement
This work was supported by JSPS KAKENHI Grant Number JP26280080.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kikuchi, Y., Watanabe, A., Ryohei, S., Takamura, H., Okumura, M. (2016). Learning from Numerous Untailored Summaries. In: Booth, R., Zhang, ML. (eds) PRICAI 2016: Trends in Artificial Intelligence. PRICAI 2016. Lecture Notes in Computer Science(), vol 9810. Springer, Cham. https://doi.org/10.1007/978-3-319-42911-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-42911-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42910-6
Online ISBN: 978-3-319-42911-3
eBook Packages: Computer ScienceComputer Science (R0)