Using KCCA for Japanese–English cross-language information retrieval and document classification
- 202 Downloads
Kernel Canonical Correlation Analysis (KCCA) is a method of correlating linear relationship between two variables in a kernel defined feature space. A machine learning algorithm based on KCCA is studied for cross-language information retrieval. We apply the algorithm in Japanese–English cross-language information retrieval. The results are quite encouraging and are significantly better than those obtained by other state of the art methods. Computational complexity is an important issue when applying KCCA to large dataset as in information retrieval. We experimentally evaluate several methods to alleviate the problem of applying KCCA to large datasets. We also investigate cross-language document classification using KCCA as well as other methods. Our results show that it is feasible to use a classifier learned in one language to classify the documents in other languages.
KeywordsCross-language information retrieval Machine learning Kernel canonical correlation analysis Unsupervised learning Cross-language Japanese–English document retrieval and classification
Unable to display preview. Download preview PDF.
- Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to Support Vector Machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press.Google Scholar
- Hardon, D. R., Szedmark, S., & Shawe-Taylor, J. (2003). Canonical correlation analysis: An overview with application to learning methods. Technical Report CSD-TR-03-02, Department of Computer Science, Royal Holloway, University of London.Google Scholar
- Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec & C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 Lecture Notes in Computer Science, Chemnitz, DE (pp. 137–142). Heidelberg, DE: Springer Verlag.Google Scholar
- Lewis, D. D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5(Apr), 361–397.Google Scholar
- Li, Y., & Shawe-Taylor, J. (2003). The SVM with uneven margins and Chinese document categorization. In Proceedings of The 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), Singapore, Oct (pp. 216–227).Google Scholar
- Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., & Kandola, J. (2002). The perceptron algorithm with uneven margins. In Proceedings of the 9th International Conference on Machine Learning (ICML-2002) (pp. 379–386).Google Scholar
- Littman, M. L., Dumais, S. T., & Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In G. Grefenstette (Ed.), Cross language information retrieval. Dordrecht: Kluwer.Google Scholar
- Makita, M., Higuchi, S., Fujii, A., & Ishikawa, T. (2003). A system for Japanese–English–Korean multilingual patent retrieval. In Proceedings of Machine Translation Summit IX. Retrieved Sept., 2003, from http://www.amtaweb.org/summit/MTSummit/papers.html.
- Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In Advances of neural information processing systems, 15.Google Scholar