Advertisement

Journal of Intelligent Information Systems

, Volume 27, Issue 2, pp 117–133 | Cite as

Using KCCA for Japanese–English cross-language information retrieval and document classification

  • Yaoyong LiEmail author
  • John Shawe-Taylor
Article

Abstract

Kernel Canonical Correlation Analysis (KCCA) is a method of correlating linear relationship between two variables in a kernel defined feature space. A machine learning algorithm based on KCCA is studied for cross-language information retrieval. We apply the algorithm in Japanese–English cross-language information retrieval. The results are quite encouraging and are significantly better than those obtained by other state of the art methods. Computational complexity is an important issue when applying KCCA to large dataset as in information retrieval. We experimentally evaluate several methods to alleviate the problem of applying KCCA to large datasets. We also investigate cross-language document classification using KCCA as well as other methods. Our results show that it is feasible to use a classifier learned in one language to classify the documents in other languages.

Keywords

Cross-language information retrieval Machine learning Kernel canonical correlation analysis Unsupervised learning Cross-language Japanese–English document retrieval and classification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.MathSciNetCrossRefGoogle Scholar
  2. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press.Google Scholar
  3. Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002). Latent semantic kernels. Journal of Intelligent Information System, 18(2/3), 127–152.CrossRefGoogle Scholar
  4. Hardon, D. R., Szedmark, S., & Shawe-Taylor, J. (2003). Canonical correlation analysis: An overview with application to learning methods. Technical Report CSD-TR-03-02, Department of Computer Science, Royal Holloway, University of London.Google Scholar
  5. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 312–377.CrossRefGoogle Scholar
  6. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec & C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 Lecture Notes in Computer Science, Chemnitz, DE (pp. 137–142). Heidelberg, DE: Springer Verlag.Google Scholar
  7. Lewis, D. D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5(Apr), 361–397.Google Scholar
  8. Li, Y., & Shawe-Taylor, J. (2003). The SVM with uneven margins and Chinese document categorization. In Proceedings of The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), Singapore, Oct (pp. 216–227).Google Scholar
  9. Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. (2002). The perceptron algorithm with uneven margins. In Proceedings of the 9th International Conference on Machine Learning (ICML-2002) (pp. 379–386).Google Scholar
  10. Littman, M. L., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In G. Grefenstette (Ed.), Cross language information retrieval. Dordrecht: Kluwer.Google Scholar
  11. Makita, M., Higuchi, S., Fujii, A., & Ishikawa, T. (2003). A system for Japanese–English–Korean multilingual patent retrieval. In Proceedings of Machine Translation Summit IX. Retrieved Sept., 2003, from http://www.amtaweb.org/summit/MTSummit/papers.html.
  12. Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In Advances of neural information processing systems, 15.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Department of Computer ScienceThe University of SheffieldSheffieldUK
  2. 2.ISIS Group, School of Electronics and Computer ScienceUniversity of SouthamptonSouthamptonUK

Personalised recommendations