Skip to main content

Similarity Measures Based on Latent Dirichlet Allocation

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Abstract

We present in this paper the results of our investigation on semantic similarity measures at word- and sentence-level based on two fully-automated approaches to deriving meaning from large corpora: Latent Dirichlet Allocation, a probabilistic approach, and Latent Semantic Analysis, an algebraic approach. The focus is on similarity measures based on Latent Dirichlet Allocation, due to its novelty aspects, while the Latent Semantic Analysis measures are used for comparison purposes. We explore two types of measures based on Latent Dirichlet Allocation: measures based on distances between probability distribution that can be applied directly to larger texts such as sentences and a word-to-word similarity measure that is then expanded to work at sentence-level. We present results using paraphrase identification data in the Microsoft Research Paraphrase corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing (ACL 2003)

    Google Scholar 

  2. Iordanskaja, L., Kittredge, R., Polgere, A.: Lexical selection and paraphrase in a meaning-text generation model. In: Natural Language Generation in Artificial Intelligence and Computational Linguistics. Kluwer Academic (1991)

    Google Scholar 

  3. Graesser, A.C., Olney, A., Haynes, B.C., Chipman, P.: Autotutor: A cognitive system that simulates a tutor that facilitates learning through mixed-initiative dialogue. In: Cognitive Systems: Human Cognitive Models in Systems Design. Erlbaum, Mahwah (2005)

    Google Scholar 

  4. Rus, V., Graesser, A.C.: Deeper natural language processing for evaluating student answers in intelligent tutoring systems. In: Paper Presented at the Annual Meeting of the American Association of Artificial Intelligence (AAAI 2006), Boston, MA, July 16-20 (2006)

    Google Scholar 

  5. Rus, V., Nan, X., Shiva, S., Chen, Y.: Clustering of Defect Reports Using Graph Partitioning Algorithms. In: Proceedings of the 20th International Conference on Software and Knowledge Engineering, Boston, MA, July 2-4 (2009)

    Google Scholar 

  6. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the Computational Linguistics UK, CLUK 2008 (2008)

    Google Scholar 

  7. Lintean, M., Rus, V.: Measuring Semantic Similarity in Short Texts through Greedy Pairing and Word Semantics. In: Proceedings of the 25th International Florida Artificial Intelligence Research Society Conference, Marco Island, FL (2012)

    Google Scholar 

  8. Dolan, B., Quirk, C., Brockett, C.: Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In: COLING 2004 (2004)

    Google Scholar 

  9. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  10. Landauer, T., McNamara, D.S., Dennis, S., Kintsch, W.: Handbook of Latent Semantic Analysis. Erlbaum, Mahwah (2007)

    Google Scholar 

  11. Miller, G.: Wordnet: a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  12. Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: Similarity-Measuring the Relatedness of Concepts. In: The Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI 2004), San Jose, CA (Intelligent Systems Demonstration), July 25-29, pp. 1024–1025 (2004)

    Google Scholar 

  13. Hirst, G., Stonge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database. MIT Press (1998)

    Google Scholar 

  14. Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 805–810 (2003)

    Google Scholar 

  15. Patwardhan, S.: Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Master’s thesis, Univ. of Minnesota, Duluth (2003)

    Google Scholar 

  16. Rus, V., Lintean, M., Graesser, A., McNamara, D.: Assessing Student Paraphrases Using Lexical Semantics and Word Weighting. In: Proceedings of the 14th International Conference on Artificial Intelligence in Education, Brighton, UK (2009)

    Google Scholar 

  17. Dagan, I., Glickman, O., Magnini, B.: The PASCAL Recognizing Textual Entailment Challenge. In: Proceedings of the Recognizing Textual Entailment Challenge Workshop (2005)

    Google Scholar 

  18. Lintean, M., Moldovan, C., Rus, V., McNamara, D.: The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis. In: Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, Daytona Beach, FL (2010)

    Google Scholar 

  19. Kozareva, Z., Montoyo, A.: Paraphrase Identification on the Basis of Supervised Machine Learning Techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  20. Celikyilmaz, A., Hakkani-Tür, D., Tur, G.: LDA Based Similarity Modeling for Question Answering. In: NAACL-HLT, Workshop on Semantic Search, Los Angeles, CA (June 2010)

    Google Scholar 

  21. Chen, X., Li, L., Xiao, H., Xu, G., Yang, Z., Kitsuregawa, M.: Recommending Related Microblogs: A Comparison between Topic and WordNet based Approaches. In: Proceedings of the 26th International Conference on Artificial Intelligence (2012)

    Google Scholar 

  22. Kuhn, H.W.: The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  23. Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics 5(1), 32–38 (1957)

    Article  MathSciNet  MATH  Google Scholar 

  24. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, United States, pp. 100–108 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rus, V., Niraula, N., Banjade, R. (2013). Similarity Measures Based on Latent Dirichlet Allocation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37247-6_37

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37246-9

  • Online ISBN: 978-3-642-37247-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics