A General Method of Mining Chinese Web Documents Based on GA&SA and Position-Factors

Bai, Xi; Sun, Jigui; Che, Haiyan; Wang, Jin

doi:10.1007/978-3-540-77018-3_41

A General Method of Mining Chinese Web Documents Based on GA&SA and Position-Factors

Xi Bai^1,2,
Jigui Sun^1,2,
Haiyan Che^1,2 &
…
Jin Wang³

Conference paper

1485 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4819))

Abstract

Clustering and classification are two important techniques of mining Web information. In this paper, a new adaptive method of mining Chinese documents from the internet is proposed. First, we give an algorithm of clustering documents which combines Genetic Algorithm(GA) and Simulated Annealing(SA) based on Boolean Model. This Algorithm avoids the disadvantage of clustering documents by using pure GA which can not be utilized accurately since GA converges too early and bogs the local optimum. Then, considering that the effect of classification with traditional Vector Space Model(VSM) is not satisfying enough since it is not related to the grades of importance of words, we add the position-factors of key words into VSM and set up a new classifier model to classify Chinese Web documents. Experimental results indicate that this adaptive method can make the process of clustering and classification more accurate and reasonable comparing to the methods which does not have the positions of words considered.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Melucci, M.: Context modeling and discovery using vector space bases. In: Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05 (2005)
Google Scholar
Goncalves, A., Jianhan, Z., Dawei, S., Uren, V., Pacheco, R.: LRD: Latent relation discovery for vector space expansion and information retrieval. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 122–133. Springer, Heidelberg (2006)
Chapter Google Scholar
Schneider, K.M.: On word frequency information and negative evidence in Naive Bayes text classification. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 474–485. Springer, Heidelberg (2004)
Google Scholar
Tang, C., Lau, R.W.H., Li, Q., Li, T., Yu, Z.: Distance courseware discrimination based on representative sentence assaying. In: Proceedings of Seven-th International Conference of Advanced Database Applications, pp. 92–99. IEEE Publishing, Hong Kong (2001)
Chapter Google Scholar
Li, T., Tang, C.J., Zuo, J.: Web document filtering technique based on natural language understanding. International Journal Computer Processing of Oriental Language 14(3), 279–291 (2001)
Article Google Scholar
Riyaz, S., Selwyn, P.: Efficient genetic slgorithm based data mining using feature selection with hausdorff distance. Information Technology and Management 6(4), 315–331 (2005)
Article Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. Journal of Chemical Physics, 1087–1092 (1953)
Google Scholar
Casillas, A., de Lena, M.T.G., Martínez, R.: Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 43–49. Springer, Heidelberg (2003)
Google Scholar
Xu, X.S., Ma, J., Wang, H.: An improved simulated annealing algorithm for the maximum independent set problem. In: Huang, D.-S., Li, K., Irwin, G.W. (eds.) ICIC 2006. LNCS, vol. 4113, pp. 822–831. Springer, Heidelberg (2006)
Chapter Google Scholar
Kang, Y.H.: Representative term based feature selection method for svm based document classification. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3681, pp. 56–61. Springer, Heidelberg (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun 130012, China
Xi Bai, Jigui Sun & Haiyan Che
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
Xi Bai, Jigui Sun & Haiyan Che
Institute of Network and Information Security, Shandong University, Jinan 250100, China
Jin Wang

Authors

Xi Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jigui Sun
View author publications
You can also search for this author in PubMed Google Scholar
Haiyan Che
View author publications
You can also search for this author in PubMed Google Scholar
Jin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Takashi Washio Zhi-Hua Zhou Joshua Zhexue Huang Xiaohua Hu Jinyan Li Chao Xie Jieyue He Deqing Zou Kuan-Ching Li Mário M. Freire

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bai, X., Sun, J., Che, H., Wang, J. (2007). A General Method of Mining Chinese Web Documents Based on GA&SA and Position-Factors. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_41

Download citation

DOI: https://doi.org/10.1007/978-3-540-77018-3_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77016-9
Online ISBN: 978-3-540-77018-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics