Abstract
We focus on automatically finding similar documents using coherent chunks. The similarity between the documents is determined by identifying the coherent chunks present in them. We apply linguistic rules in identifying the coherent chunks and uses Vector Space Model (VSM) in determining the similarity among documents. We have taken patent documents from USPTO for this work. This method of using coherent chunks for identifying similar documents has shown encouraging results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pp. 79–85 (1998)
Brill, E.: Some Advances in transformation Based Part of Speech Tagging. In: Proceedings of the Twelfth International Conference on Artificial Intelligence (AAAI 1994), Seattle, WA (1994)
Cohen, R.: A Computational Model for the Analysis of Arguments. Ph.D. Thesis and Tech. Rep. 151, University of Toronto (1983)
Fellbaum, C.: WordNet: An Electronic Lexical Database, pp. 1–12. MIT Press, Cambridge (1998)
Frakes, W.B., Baeza-Yates, R. (eds.): Information Retrieval, Data Structure and Algorithms. Prentice Hall, Englewood Cliffs (1992)
Grosz, B.J., Joshi, A.K., Weinstein, S.: Providing a Unified Account of Definite Noun Phrases in Discourse. ACL, June 1983, pp. 44–50. MIT Press, Cambridge (1983)
Gruber, T.R.: A translation approach to portable ontologies. Knowledge Acquisition 5(2), 199–220 (1993)
Hindle, D.: Noun classification from predicate-argument structures. In: Proceedings of ACL 1990, pp. 268–275 (1990)
Hobbs, J.R.: On the Coherence and Structure of Discourse. In: Polanyi, L. (ed.) The Structure of Discourse, Ablex Publishing Corporation, Greenwich (1985); Forthcoming. Also: CSLI (Stanford) Report No. CSLI-85-37, October (1985)
Lee, J.H., Kim, M.H., Lee, Y.J.: Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation 49(2), 188–207 (1989)
Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of International Conference on Machine Learning, Madison, Wisconsin (July 1998)
Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., Saarela, A.: Self organisation of a massive document collection. IEEE Transactions on Neural Networks 11(3), 574–585 (2000)
McGill, M.: An evaluation of factors affecting document ranking by information retrieval systems. Project report, Syracuse University School of Information Studies (1979)
McKeown, K.R.: Generating Natural Language Text in Response to Questions about Database Structure. PhD Thesis, University of Pennsylvania, Philadelphia (1982)
Ngai, G., Florian, R.: Transformation-Based Learning in the Fast Lane. In: Proceedings of the NAACL 2001, Pittsburgh, PA, pp. 40–47 (2001)
Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE Transaction on Systems, Man, and Cybernetics 19(1), 17–30 (1989)
Rao Pattabhi, R.K., Sobha, L., Bagga, A.: Multilingual cross-document coreferencing. In: Proceedings of 6th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC), Portugal, pp. 115–119 (2007)
Rauber, A., Merkl, D.: The SOMLib digital library system. In: Abiteboul, S., Vercoustre, A.-M. (eds.) ECDL 1999. LNCS, vol. 1696, pp. 323–341. Springer, Heidelberg (1999)
Reichman, R.: Conversational Coherency. Cognitive Science 2(4), 283–328 (1978)
Reichman-Adar, R.: Extended Person-Machine Interfaces. Artificial Intelligence 22(2), 157–218 (1984)
Resnik, P.: Using information content to evaluate semantic similarity in taxonomy. In: Proceedings of IJCAI, pp. 448–453 (1995)
Salton, G.: Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Tversky, A.: Features of similarity. Psychological Review 84, 327–352 (1977)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lalitha Devi, S., Kuppan, S., Venkataswamy, K., Rao, P.R.K. (2009). Identification of Similar Documents Using Coherent Chunks. In: Lalitha Devi, S., Branco, A., Mitkov, R. (eds) Anaphora Processing and Applications. DAARC 2009. Lecture Notes in Computer Science(), vol 5847. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04975-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-04975-0_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04974-3
Online ISBN: 978-3-642-04975-0
eBook Packages: Computer ScienceComputer Science (R0)