Skip to main content

Identification of Similar Documents Using Coherent Chunks

  • Conference paper
  • 472 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5847))

Abstract

We focus on automatically finding similar documents using coherent chunks. The similarity between the documents is determined by identifying the coherent chunks present in them. We apply linguistic rules in identifying the coherent chunks and uses Vector Space Model (VSM) in determining the similarity among documents. We have taken patent documents from USPTO for this work. This method of using coherent chunks for identifying similar documents has shown encouraging results.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 79–85 (1998)

    Google Scholar 

  2. Brill, E.: Some Advances in transformation Based Part of Speech Tagging. In: Proceedings of the Twelfth International Conference on Artificial Intelligence (AAAI 1994), Seattle, WA (1994)

    Google Scholar 

  3. Cohen, R.: A Computational Model for the Analysis of Arguments. Ph.D. Thesis and Tech. Rep. 151, University of Toronto (1983)

    Google Scholar 

  4. Fellbaum, C.: WordNet: An Electronic Lexical Database, pp. 1–12. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  5. Frakes, W.B., Baeza-Yates, R. (eds.): Information Retrieval, Data Structure and Algorithms. Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  6. Grosz, B.J., Joshi, A.K., Weinstein, S.: Providing a Unified Account of Definite Noun Phrases in Discourse. ACL, June 1983, pp. 44–50. MIT Press, Cambridge (1983)

    Google Scholar 

  7. Gruber, T.R.: A translation approach to portable ontologies. Knowledge Acquisition 5(2), 199–220 (1993)

    Article  Google Scholar 

  8. Hindle, D.: Noun classification from predicate-argument structures. In: Proceedings of ACL 1990, pp. 268–275 (1990)

    Google Scholar 

  9. Hobbs, J.R.: On the Coherence and Structure of Discourse. In: Polanyi, L. (ed.) The Structure of Discourse, Ablex Publishing Corporation, Greenwich (1985); Forthcoming. Also: CSLI (Stanford) Report No. CSLI-85-37, October (1985)

    Google Scholar 

  10. Lee, J.H., Kim, M.H., Lee, Y.J.: Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation 49(2), 188–207 (1989)

    Article  Google Scholar 

  11. Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of International Conference on Machine Learning, Madison, Wisconsin (July 1998)

    Google Scholar 

  12. Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., Saarela, A.: Self organisation of a massive document collection. IEEE Transactions on Neural Networks 11(3), 574–585 (2000)

    Article  Google Scholar 

  13. McGill, M.: An evaluation of factors affecting document ranking by information retrieval systems. Project report, Syracuse University School of Information Studies (1979)

    Google Scholar 

  14. McKeown, K.R.: Generating Natural Language Text in Response to Questions about Database Structure. PhD Thesis, University of Pennsylvania, Philadelphia (1982)

    Google Scholar 

  15. Ngai, G., Florian, R.: Transformation-Based Learning in the Fast Lane. In: Proceedings of the NAACL 2001, Pittsburgh, PA, pp. 40–47 (2001)

    Google Scholar 

  16. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Transaction on Systems, Man, and Cybernetics 19(1), 17–30 (1989)

    Article  Google Scholar 

  17. Rao Pattabhi, R.K., Sobha, L., Bagga, A.: Multilingual cross-document coreferencing. In: Proceedings of 6th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC), Portugal, pp. 115–119 (2007)

    Google Scholar 

  18. Rauber, A., Merkl, D.: The SOMLib digital library system. In: Abiteboul, S., Vercoustre, A.-M. (eds.) ECDL 1999. LNCS, vol. 1696, pp. 323–341. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  19. Reichman, R.: Conversational Coherency. Cognitive Science 2(4), 283–328 (1978)

    Article  Google Scholar 

  20. Reichman-Adar, R.: Extended Person-Machine Interfaces. Artificial Intelligence 22(2), 157–218 (1984)

    Article  Google Scholar 

  21. Resnik, P.: Using information content to evaluate semantic similarity in taxonomy. In: Proceedings of IJCAI, pp. 448–453 (1995)

    Google Scholar 

  22. Salton, G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  23. Tversky, A.: Features of similarity. Psychological Review 84, 327–352 (1977)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lalitha Devi, S., Kuppan, S., Venkataswamy, K., Rao, P.R.K. (2009). Identification of Similar Documents Using Coherent Chunks. In: Lalitha Devi, S., Branco, A., Mitkov, R. (eds) Anaphora Processing and Applications. DAARC 2009. Lecture Notes in Computer Science(), vol 5847. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04975-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04975-0_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04974-3

  • Online ISBN: 978-3-642-04975-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics