Optimization of Cross-Lingual LSI Training Data

  • John Pozniak
  • Roger BradfordEmail author
Conference paper
Part of the Studies in Computational Intelligence book series (SCI, volume 614)


The technique of latent semantic indexing (LSI) is widely employed in applications to provide information retrieval, categorization, clustering, and discovery capabilities. In these applications, the key relevant feature of the technique is the ability to compare objects (such as documents and queries) based on the semantics of their constituents. These comparisons are carried out in a high-dimensional vector space. That space is generated based on an analysis of occurrences of features in items of a training set. In the LSI literature there are multiple references to the fact that training items should be selected that are similar in content to the items to be dealt with in the application. This paper presents a principled approach for making such selection. We present test results for the technique for cross-lingual document similarity comparison. The results demonstrate that, at least for this use case, employment of the technique can have a dramatic beneficial effect on LSI performance.


Latent semantic indexing LSI LSI training LSI optimization Training set optimization Cross-lingual LSI 


  1. 1.
    Deerwester, S., et al.: Improving information retrieval with latent semantic indexing. In: Proceedings of ASIS’88, Atlanta, GA, pp. 36–40 (1988)Google Scholar
  2. 2.
    Bradford, R.: Comparability of LSI and human judgment in text analysis tasks. In: Proceedings of Applied Computing Conference, Athens, Greece, pp. 359–366 (2009)Google Scholar
  3. 3.
    Landauer, T., Dumais, S.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)CrossRefGoogle Scholar
  4. 4.
    Lee, M., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Proceedings of 27th Annual Conference of the Cognitive Science Society, pp. 1254–1259 (2005)Google Scholar
  5. 5.
    Zelikovitz, S.: Transductive LSI for short text classification problems. Int. J. Pattern Recognit. Artif. Intell. 19(2), 143–163 (2005)CrossRefGoogle Scholar
  6. 6.
    Klebanov, B., Wiemer-Hastings, P.: The role of wor(l)d knowledge in pronominal anaphora resolution. In: Proceedings of International Symposium on Reference Resolution for Natural Language Processing, Alicante, Spain, 3–4 June 2002Google Scholar
  7. 7.
    Wiemer-Hastings, P., Wiemer-Hastings, K., Graesser, A.: Open Learning Environments: New Computational Technologies to Support Learning, Exploration and Collaboration. IOS Press, pp. 535–542 (1999)Google Scholar
  8. 8.
    Jung, K.: Mismatches between humans and latent semantic analysis in document similarity judgments. Doctoral thesis. University of New Mexico, July 2013Google Scholar
  9. 9.
    Bellegarda, J., Naik, D., Silverman, K.: Automatic junk e-mail filtering based on latent content. Autom. Speech Recognit. Underst. 465–470 (2003)Google Scholar
  10. 10.
    Soto, R.: Learning and performing by exploration: label quality measured by latent semantic analysis. In: CHI’99, pp. 418–425 (1999)Google Scholar
  11. 11.
    Olmos, R., et al.: An analysis of size and specificity of corpora in the assessment of summaries using LSA: a comparative study between LSA and human raters. Revista Signos (in Spanish) 42(69), 71–81 (2009)Google Scholar
  12. 12.
    Kurby, C., et al.: Computerizing reading training: evaluation of a latent semantic analysis space for science text. Behav. Res. Methods Instrum. Comput. 33(2), 244–250 (2003)CrossRefGoogle Scholar
  13. 13.
    Kaur, I., Hornof, A.: A comparison of LSA, WordNet, and PMI\(\_\)IR for predicting user click behavior. In: Proceedings of CHI 2005, Portland, Oregon, 2–7 April 2005Google Scholar
  14. 14.
    Bellegarda, J.: Exploiting latent semantic information in statistical language modeling. Proc. IEEE 88(8), 1279–1296 (2000)CrossRefGoogle Scholar
  15. 15.
    Perez, D., et al.: Automatic assessment of students’ free-text answers underpinned by the combination of a BLEU-inspired algorithm and latent semantic analysis. In: Proceedings of 18th International Florida Artificial Intelligence Research Society Conference, FLAIRS (2005)Google Scholar
  16. 16.
    Zelikovitz, S., Hafner, R.: Automatic generation of background text to aid classification. In: Proceedings of FLAIRS Conference (2004)Google Scholar
  17. 17.
    Zelikovitz, S., Kogan, M.: Using web searches on important words to create background sets for LSI classification. In: Proceedings of FLAIRS Conference’06, pp. 598–603 (2006)Google Scholar
  18. 18.
    Zelikovitz, S., Marquez, F.: Evaluation of background knowledge for latent semantic indexing classification. In: Proceedings of Eighteenth International Florida Artificial Intelligence Research Society Conference, Clearwater Beach, Florida, USA (2005)Google Scholar
  19. 19.
    Olde, B., et al.: The right stuff: do you need to sanitize your corpus when using latent semantic analysis? In: Proceedings of the 24th Annual Meeting of the Cognitive Science Society, pp. 708–713 (2002)Google Scholar
  20. 20.
    Mohler, M., Mihalcea, R.: Text-to-text similarity for automatic short answer grading. In: Proceedings of 12th Conference of the European Chapter of the ACL, Athens, Greece, 30 March–3 April 2009, pp. 567–575Google Scholar
  21. 21.
    Haley, D., et al.: Measuring improvement in latent semantic analysis-based marking systems: using a computer to mark questions about HTML. In: Proceedings of Ninth Australasian Computing Conference, Ballarat, Victoria, Australia (2007)Google Scholar
  22. 22.
    Haley, D.: Applying latent semantic analysis to computer assisted assessment in the computer science domain: a framework, a tool, and an evaluation. Doctoral thesis, The Open University (2009)Google Scholar
  23. 23.
    Cox, W., Pincombe, B.: Cross-lingual latent semantic analysis. ANZIAM J. 48, C1054–C1074 (2008)Google Scholar
  24. 24.
    Stone, B., Dennis, S., Kwantes, P.: Comparing methods for single paragraph similarity analysis. Top. Cogn. Sci. 3, 92–112 (2011)CrossRefGoogle Scholar
  25. 25.
    Zelikovitz, S., Hirsh, H.: Improving short-text classification using unlabeled background knowledge to assess document similarity. In: Proceedings of the 17th International Conference on Machine Learning, pp. 1183–1190Google Scholar
  26. 26.
    Hull, D.: Improving text retrieval for the routing problem using latent semantic indexing. In: Proceedings of 17th ACM SIGIR Conference, pp. 282–291Google Scholar
  27. 27.
    Hull, D.: Information retrieval using statistical classification. Doctoral thesis, Stanford University, November 1994Google Scholar
  28. 28.
    Wiener, E., Pederson, J., Weigend, A.: A neural network approach to topic spotting. In: Proceedings of Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), Las Vegas, NV, 24–26 April 1995Google Scholar
  29. 29.
    Schutze, H., Hull, D., Pedersen, J.: A comparison of classifiers and document representations for the routing problem. In: Proceedings of SIGIR, vol. 95, pp. 229–237 (1995)Google Scholar
  30. 30.
    Jiang, F.: Matrix computations for query expansion in information retrieval. Doctoral thesis, Duke University, September 2000Google Scholar
  31. 31.
    Liu, T., et al.: Improving text classification using local latent semantic indexing. In: Proceedings of ICDM’04, pp. 162–169Google Scholar
  32. 32.
    Ding, W.: LRLW-LSI: an improved latent semantic indexing (LSI) text classifier. In: Proceedings of Third International Conference on Rough Sets and Knowledge Technology, Chengdu, China, pp. 483–490Google Scholar
  33. 33.
    Dumais, S., Letsche, T., Littman, M., Landauer, T.: Automatic cross-language retrieval using latent semantic indexing. In: Proceedings of AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval, 24–26 March 1997, pp. 18–24Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.LeidosChantillyUSA
  2. 2.Maxim AnalyticsRestonUSA

Personalised recommendations