Abstract
Multilingual corpora are becoming an essential resource for work in multilingual natural language processing. The aim of this paper is to investigate the effects of applying a clustering technique to parallel multilingual texts. It is interesting to look at the differences of the cluster mappings and the tree structures of the clusters. The effect of reducing the set of terms considered in clustering parallel corpora is also studied. After that, a genetic-based algorithm is applied to optimize the weights of terms considered in clustering the texts to classify unseen examples of documents. Specifically, the aim of this work is to introduce the tools necessary for this task and display a set of experimental results and issues which have become apparent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Technical Report 00-34, University of Minnesota
Zhao, Y., Karypis, G.: Evaluation of Hierarchical Clustering Algorithms for Document Datasets. ACM Press, New York (2002)
Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B.: Web Page Categorisation and Feature Selection using Association Rule and Principal Component Clustering. In: 7th Workshop on Information Technologies and Systems (1997)
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Research and Development in Information Retrieval, pp. 46–54 (1998)
Romaric, B.M.: Multilingual Document Clusters Discovery. In: RIAO, pp. 116–125 (2004)
Kikui, G., Hayashi, Y., Suzaki, S.: Cross-lingual Information Retrieval on the WWW. In: Multilinguality in Software Engineering: The AI Contribution (1996)
Xu, J., Weischedel, R.: Cross-lingual Information Retrieval Using Hidden Markov Models. In: The Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC 2000) (2000)
Nakov, P.: BulStem: Design and Evaluation of Inflectional Stemmer for Bulgarian. In: Proceedings of Workshop on Balkan Language Resources and Tools (2003)
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)
Davies, D.L., Bouldin, D.W.: A Cluster Separation Measure. IEEE Trans. Pattern Analysis and Machine Intelligence, 224–227 (1979)
Alfred, R., Paskaleva, E., Kazakov, D., Bartlett, M.: HAC For Cross-language Information Retrieval. International Journal of Translation 19(1), 139–162
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alfred, R. (2009). A Parallel Hierarchical Agglomerative Clustering Technique for Billingual Corpora Based on Reduced Terms with Automatic Weight Optimization. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-03348-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03347-6
Online ISBN: 978-3-642-03348-3
eBook Packages: Computer ScienceComputer Science (R0)