A Study of the Effect of Document Representations in Clustering-Based Cross-Document Coreference Resolution

Saggion, Horacio

doi:10.1007/978-3-642-28569-1_6

Horacio Saggion⁵

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

1899 Accesses

Abstract

Finding information about people on huge text collections or on-line repositories on the Web is a common activity. We describe experiments aiming at identifying the contribution of semantic information (e.g., named entities) and summarization (e.g., sentence extracts) in a cross-document coreference resolution system. Our system uses a clustering-based algorithm to group documents referring to the same entity. Clustering uses vector representations created by summarization and semantic tagging components. We investigate different clustering configurations and show that selection of the type of summary and the type of term to be used for vector representation is important to achieve good performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdalla R., Teufel, S.: A bootstrapping approach to unsupervised detection of cue phrase variants. In: Proceedings of COLING/ACL 2006, Sydney (2006)
Google Scholar
Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigó, E.: Weps-3 evaluation campaign: overview of the web people search clustering and attribute extraction tasks. In: CLEF - Notebook Papers/LABs/Workshops, Padova, Italy (2010)
Google Scholar
Artiles, J., Gonzalo, J., Sekine, S.: The semEval-2007 wePS evaluation: establishing a benchmark for web people search task. In: Proceedings of Semeval 2007, Prague, Czech Republic. Association for Computational Linguistics, Stroudsburg (2007)
Google Scholar
Aswani, N., Bontcheva, K., Cunningham, H.: Mining information for instance unification. In: 5th International Semantic Web Conference (ISWC2006), Athens. Springer, Berlin/Heidelberg (2006). http://gate.ac.uk/sale/iswc06/iswc06.pdf
Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL’98), Montreal, pp. 79–85. Association for Computational Linguistics, Stroudsburg (1998)
Google Scholar
Bagga, A., Baldwin, B., Ramesh, G.: Methodology for cross-document coreference over degraded data sources. In: Angelova, G., Bontcheva, K., Mitkov, R., Nikolov, N., Nicolov, N. (eds.) Proceedings of Recent Advances in Natural Language Processing (RANLP’01), Tzigov Chark, Bulgaria, pp. 15–21 (2001)
Google Scholar
Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: Proceedings of WWW-05, the 14th International World Wide Web Conference, Chiba. ACM, New York (2005)
Google Scholar
Chen, Y., Martin, J.: Cu-comsem: Exploring rich features for unsupervised web personal named disambiguation. In: Proceedings of SemEval 2007, Prague, pp. 125–128. Assocciation for Computational Linguistics, Stroudsburg (2007)
Google Scholar
Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, pp. 318–329 (1992)
Google Scholar
Day, D., Hitzeman, J., Wick, J., Crouch, K., Poesio, M.: A corpus for cross-document co-reference. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association, Paris, France (2008)
Google Scholar
Grishman, R.: Information extraction: techniques and challenges. In: Pazienza, M.T. (ed.) Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School (SCIE-97), Lecture Notes in Computer Science, vol. 1299, pp. 10–27. Springer, Frascati, Italy (1997)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: WordNet improves text document clustering. In: Proceeding of the SIGIR 2003 Semantic Web Workshop, Toronto (2003)
Google Scholar
Mani, I.: Automatic Summarization. John Benjamins, Amsterdam/Philadelphia (2001)
Google Scholar
Mann, G., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of CoNLL, Edmonton. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Daelemans, W., Osborne, M. (eds.) Proceedings of the 7th Conference on Natural Language Learning (CoNLL-2003), Edmonton, pp. 33–40. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K., Wilks, Y.: Architectural elements of language engineering robustness. J. Nat. Lang. Eng. Spec. Issue Robust Methods Anal. Nat. Lang. Data 8(2/3), 257–274 (2002). http://www.gate.ac.uk/sale/robust/robust.pdf
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Personal name resolution crossover documents by a semantics-based approach. IEICE Trans. Inf. Syst. 89, 825–836 (2006)
Google Scholar
Radev, D.R., Teufel, S., Saggion, H., Lam, W., Blitzer, J., Qi, H., Çelebi, A., Liu, D., Drábek, E.: Evaluation challenges in large-scale document summarization. In: ACL, Sapporo, pp. 375–382 (2003)
Google Scholar
Rasmussen, E., Willett, P.: Non-hierarchical document clustering using the icl distribution array processor. In: SIGIR ’87: Proceedings of the 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, pp. 132–139. ACM Press, New York, NY, USA (1987)
Google Scholar
Saggion, H.: Shef: Semantic tagging and summarization techniques applied to cross-document coreference. In: Proceedings of SemEval 2007, Prague, Czech Republic, pp. 292–295. Assocciation for Computational Linguistics, Stroudsburg, PA, USA (2007). http://gate.ac.uk/sale/semeval07/papers/shef-semeval07.pdf
Saggion, H.: Experiments on semantic-based clustering for cross-document coreference. In: Proceedings of the Third Joint International Conference on Natural Language Processing, AFNLP, Hyderabad, pp. 149–156 (2008)
Google Scholar
Saggion, H.: SUMMA: a robust and adaptable summarization tool. Traitement Automatique des Langues 49(2), 103–125 (2008)
Google Scholar
Saggion, H., Gaizauskas, R.: Multi-document summarization by cluster/profile relevance and redundancy removal. In: Proceedings of the Document Understanding Conference 2004, Boston, USA. NIST, Gaithersburg, MD, USA (2004)
Google Scholar
Saggion, H., Lloret, E., Palomar, M.: Using text summaries for predicting rating scales. In: Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), Lisbon, Portugal, pp. 44–51 (2010)
Google Scholar
Saggion, H., Radev, D., Teufel, S., Wai, L., Strassel, S.: Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. In: 3rd International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Gran Canaria, pp. 747–754 (2002)
Google Scholar
Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1988)
Google Scholar
Tombros, A., Sanderson, M., Gray, P.: Advantages of query biased summaries in information retrieval. In: Intelligent Text Summarization. Papers from the 1998 AAAI Spring Symposium. Technical Report SS-98-06, The AAAI Press, Standford, pp. 34–43 (1998)
Google Scholar
van Rijsbergen, C.: Information Retrieval. Butterworths, London (1979)
Google Scholar
Willett, P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process. Manage. 24(5), 577–597 (1988)
Google Scholar

Download references

Acknowledgements

We thank the reviewers for their comments and suggestions which helped improve the final version of this paper. Horacio Saggion is grateful to a fellowship from Programa Ramón y Cajal, Ministerio de Ciencia e Innovación, Spain. We acknowledge the support from the editors of this volume.

Author information

Authors and Affiliations

Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, Spain
Horacio Saggion

Authors

Horacio Saggion
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Horacio Saggion .

Editor information

Editors and Affiliations

Universite Sorbonne Nouvelle, LATTICE-CNRS, Ecole Normale Superieure and, rue d'Ulm 45, Paris, 75005, France
Thierry Poibeau
, Information & Communication Technologies, Universitat Pompeu Fabra, C/ Tanger 122-140, Barcelona, 08018, Spain
Horacio Saggion
Institute for Computer Science, Polish Acadmey of Science, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Jakub Piskorski
Department of Computer Science, University of Helsinki, Gustaf Hällströmin katu 2, Helsinki, 00014, Finland
Roman Yangarber

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Saggion, H. (2013). A Study of the Effect of Document Representations in Clustering-Based Cross-Document Coreference Resolution. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-28569-1_6
Published: 12 July 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28568-4
Online ISBN: 978-3-642-28569-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics