Skip to main content

A Study of the Effect of Document Representations in Clustering-Based Cross-Document Coreference Resolution

  • Chapter
  • First Online:
Multi-source, Multilingual Information Extraction and Summarization
  • 1899 Accesses

Abstract

Finding information about people on huge text collections or on-line repositories on the Web is a common activity. We describe experiments aiming at identifying the contribution of semantic information (e.g., named entities) and summarization (e.g., sentence extracts) in a cross-document coreference resolution system. Our system uses a clustering-based algorithm to group documents referring to the same entity. Clustering uses vector representations created by summarization and semantic tagging components. We investigate different clustering configurations and show that selection of the type of summary and the type of term to be used for vector representation is important to achieve good performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdalla R., Teufel, S.: A bootstrapping approach to unsupervised detection of cue phrase variants. In: Proceedings of COLING/ACL 2006, Sydney (2006)

    Google Scholar 

  2. Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigó, E.: Weps-3 evaluation campaign: overview of the web people search clustering and attribute extraction tasks. In: CLEF - Notebook Papers/LABs/Workshops, Padova, Italy (2010)

    Google Scholar 

  3. Artiles, J., Gonzalo, J., Sekine, S.: The semEval-2007 wePS evaluation: establishing a benchmark for web people search task. In: Proceedings of Semeval 2007, Prague, Czech Republic. Association for Computational Linguistics, Stroudsburg (2007)

    Google Scholar 

  4. Aswani, N., Bontcheva, K., Cunningham, H.: Mining information for instance unification. In: 5th International Semantic Web Conference (ISWC2006), Athens. Springer, Berlin/Heidelberg (2006). http://gate.ac.uk/sale/iswc06/iswc06.pdf

  5. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL’98), Montreal, pp. 79–85. Association for Computational Linguistics, Stroudsburg (1998)

    Google Scholar 

  6. Bagga, A., Baldwin, B., Ramesh, G.: Methodology for cross-document coreference over degraded data sources. In: Angelova, G., Bontcheva, K., Mitkov, R., Nikolov, N., Nicolov, N. (eds.) Proceedings of Recent Advances in Natural Language Processing (RANLP’01), Tzigov Chark, Bulgaria, pp. 15–21 (2001)

    Google Scholar 

  7. Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: Proceedings of WWW-05, the 14th International World Wide Web Conference, Chiba. ACM, New York (2005)

    Google Scholar 

  8. Chen, Y., Martin, J.: Cu-comsem: Exploring rich features for unsupervised web personal named disambiguation. In: Proceedings of SemEval 2007, Prague, pp. 125–128. Assocciation for Computational Linguistics, Stroudsburg (2007)

    Google Scholar 

  9. Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, pp. 318–329 (1992)

    Google Scholar 

  10. Day, D., Hitzeman, J., Wick, J., Crouch, K., Poesio, M.: A corpus for cross-document co-reference. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association, Paris, France (2008)

    Google Scholar 

  11. Grishman, R.: Information extraction: techniques and challenges. In: Pazienza, M.T. (ed.) Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School (SCIE-97), Lecture Notes in Computer Science, vol. 1299, pp. 10–27. Springer, Frascati, Italy (1997)

    Google Scholar 

  12. Hotho, A., Staab, S., Stumme, G.: WordNet improves text document clustering. In: Proceeding of the SIGIR 2003 Semantic Web Workshop, Toronto (2003)

    Google Scholar 

  13. Mani, I.: Automatic Summarization. John Benjamins, Amsterdam/Philadelphia (2001)

    Google Scholar 

  14. Mann, G., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of CoNLL, Edmonton. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  15. Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Daelemans, W., Osborne, M. (eds.) Proceedings of the 7th Conference on Natural Language Learning (CoNLL-2003), Edmonton, pp. 33–40. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  16. Maynard, D., Tablan, V., Cunningham, H., Ursu, C., Saggion, H., Bontcheva, K., Wilks, Y.: Architectural elements of language engineering robustness. J. Nat. Lang. Eng. Spec. Issue Robust Methods Anal. Nat. Lang. Data 8(2/3), 257–274 (2002). http://www.gate.ac.uk/sale/robust/robust.pdf

  17. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Personal name resolution crossover documents by a semantics-based approach. IEICE Trans. Inf. Syst. 89, 825–836 (2006)

    Google Scholar 

  18. Radev, D.R., Teufel, S., Saggion, H., Lam, W., Blitzer, J., Qi, H., Çelebi, A., Liu, D., Drábek, E.: Evaluation challenges in large-scale document summarization. In: ACL, Sapporo, pp. 375–382 (2003)

    Google Scholar 

  19. Rasmussen, E., Willett, P.: Non-hierarchical document clustering using the icl distribution array processor. In: SIGIR ’87: Proceedings of the 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, pp. 132–139. ACM Press, New York, NY, USA (1987)

    Google Scholar 

  20. Saggion, H.: Shef: Semantic tagging and summarization techniques applied to cross-document coreference. In: Proceedings of SemEval 2007, Prague, Czech Republic, pp. 292–295. Assocciation for Computational Linguistics, Stroudsburg, PA, USA (2007). http://gate.ac.uk/sale/semeval07/papers/shef-semeval07.pdf

  21. Saggion, H.: Experiments on semantic-based clustering for cross-document coreference. In: Proceedings of the Third Joint International Conference on Natural Language Processing, AFNLP, Hyderabad, pp. 149–156 (2008)

    Google Scholar 

  22. Saggion, H.: SUMMA: a robust and adaptable summarization tool. Traitement Automatique des Langues 49(2), 103–125 (2008)

    Google Scholar 

  23. Saggion, H., Gaizauskas, R.: Multi-document summarization by cluster/profile relevance and redundancy removal. In: Proceedings of the Document Understanding Conference 2004, Boston, USA. NIST, Gaithersburg, MD, USA (2004)

    Google Scholar 

  24. Saggion, H., Lloret, E., Palomar, M.: Using text summaries for predicting rating scales. In: Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), Lisbon, Portugal, pp. 44–51 (2010)

    Google Scholar 

  25. Saggion, H., Radev, D., Teufel, S., Wai, L., Strassel, S.: Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. In: 3rd International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Gran Canaria, pp. 747–754 (2002)

    Google Scholar 

  26. Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1988)

    Google Scholar 

  27. Tombros, A., Sanderson, M., Gray, P.: Advantages of query biased summaries in information retrieval. In: Intelligent Text Summarization. Papers from the 1998 AAAI Spring Symposium. Technical Report SS-98-06, The AAAI Press, Standford, pp. 34–43 (1998)

    Google Scholar 

  28. van Rijsbergen, C.: Information Retrieval. Butterworths, London (1979)

    Google Scholar 

  29. Willett, P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process. Manage. 24(5), 577–597 (1988)

    Google Scholar 

Download references

Acknowledgements

We thank the reviewers for their comments and suggestions which helped improve the final version of this paper. Horacio Saggion is grateful to a fellowship from Programa Ramón y Cajal, Ministerio de Ciencia e Innovación, Spain. We acknowledge the support from the editors of this volume.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Horacio Saggion .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Saggion, H. (2013). A Study of the Effect of Document Representations in Clustering-Based Cross-Document Coreference Resolution. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28569-1_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28568-4

  • Online ISBN: 978-3-642-28569-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics