An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Kim, Yu-Seop; Chang, Jeong-Ho; Zhang, Byoung-Tak

doi:10.1007/3-540-36175-8_11

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Yu-Seop Kim⁵,
Jeong-Ho Chang⁶ &
Byoung-Tak Zhang⁵

Conference paper
First Online: 01 January 2003

1163 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Abstract

In this paper, we try to find empirically the optimal dimensionality in data-driven models, Latent Semantic Analysis (LSA) model and Probabilistic Latent Semantic Analysis (PLSA) model. These models are used for building linguistic semantic knowledge which could be used in estimating contextual semantic similarity for the target word selection in English-Korean machine translation. We also facilitate k-Nearest Neighbor learning algorithm. We diversify our experiments by analyzing the covariance between the value of k in k-NN learning and accuracy of selection, in addition to that between the dimensionality and the accuracy. While we could not find regular tendency of relationship between the dimensionality and the accuracy, however, we could find the optimal dimensionality having the most sound distribution of data during experiments.

This work was supported by the Korea Ministry of Science and Technology under the BrainTech Project

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bain, L., M. Engelhardt, “Introduction to Probability and Mathematical Statistics,” PWS publishers, pp. 179, 190, 1987.
Google Scholar
Berry., M., T. Do, G. O’Brien, V. Krishna, and S. Varadhan, “SVDPACKC: Version 1.0 User’s Guide,” University of Tennessee Technical Report, CS-93-194, 1993.
Google Scholar
Cover, T., and P. Hart, “Nearest Neighbor Pattern Classification,” IEEE Transactions on Information Theory, 13, pp. 21–27, 1967.
Article MATH Google Scholar
Hofmann, T., “Probabilistic latent semantic indexing,” Proceedings of the 22th Annual International ACM SIGIR conference on Research and Developement in Information Retrieval (SIGIR99), pp. 50–57, 1999.
Google Scholar
Kim, Y., B. Zhang and Y. Kim, “Collocation Dictionary Optimization using WordNet and k-Nearest Neighbor Learning,” Machine Translation 16(2), pp. 89–108, 2001.
Article MATH Google Scholar
Kim, Y., J. Chang, and B. Zhang, “A comparative evaluation of data-driven models in translation selection of machine translation,” Proceedings of the 19th Internation Conference on Computational Linguistics (COLING-2002), Taipei, Taiwan, pp. 453–459, 2002.
Google Scholar
Landauer, T. K., P. W. Foltz, and D. Laham, “An Introduction to Latent Semantic Analysis,” Discourse Processes, 25, pp. 259–284, 1998.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Division of Information and Telecommunication Engineering, Hallym University, Kang-Won, Korea, 200-702
Yu-Seop Kim & Byoung-Tak Zhang
School of Computer Science and Engineering, Seoul National University, Seoul, Korea, 151-744
Jeong-Ho Chang

Authors

Yu-Seop Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jeong-Ho Chang
View author publications
You can also search for this author in PubMed Google Scholar
Byoung-Tak Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Korea Advanced Institute of Science and Technology, 373-1 Koo-Sung Dong, Yoo-Sung Ku, Daejeon, 305-701, Korea
Kyu-Young Whang
Department of Statistics, Seoul National University, Sillimdong Kwanakgu, Seoul, 151-742, Korea
Jongwoo Jeon
School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
Kyuseok Shim
Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN, 55455, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, YS., Chang, JH., Zhang, BT. (2003). An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_11

Download citation

DOI: https://doi.org/10.1007/3-540-36175-8_11
Published: 30 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics