Skip to main content

A General Method of Mining Chinese Web Documents Based on GA&SA and Position-Factors

  • Conference paper
  • 1485 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4819))

Abstract

Clustering and classification are two important techniques of mining Web information. In this paper, a new adaptive method of mining Chinese documents from the internet is proposed. First, we give an algorithm of clustering documents which combines Genetic Algorithm(GA) and Simulated Annealing(SA) based on Boolean Model. This Algorithm avoids the disadvantage of clustering documents by using pure GA which can not be utilized accurately since GA converges too early and bogs the local optimum. Then, considering that the effect of classification with traditional Vector Space Model(VSM) is not satisfying enough since it is not related to the grades of importance of words, we add the position-factors of key words into VSM and set up a new classifier model to classify Chinese Web documents. Experimental results indicate that this adaptive method can make the process of clustering and classification more accurate and reasonable comparing to the methods which does not have the positions of words considered.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Melucci, M.: Context modeling and discovery using vector space bases. In: Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05 (2005)

    Google Scholar 

  2. Goncalves, A., Jianhan, Z., Dawei, S., Uren, V., Pacheco, R.: LRD: Latent relation discovery for vector space expansion and information retrieval. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 122–133. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  3. Schneider, K.M.: On word frequency information and negative evidence in Naive Bayes text classification. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 474–485. Springer, Heidelberg (2004)

    Google Scholar 

  4. Tang, C., Lau, R.W.H., Li, Q., Li, T., Yu, Z.: Distance courseware discrimination based on representative sentence assaying. In: Proceedings of Seven-th International Conference of Advanced Database Applications, pp. 92–99. IEEE Publishing, Hong Kong (2001)

    Chapter  Google Scholar 

  5. Li, T., Tang, C.J., Zuo, J.: Web document filtering technique based on natural language understanding. International Journal Computer Processing of Oriental Language 14(3), 279–291 (2001)

    Article  Google Scholar 

  6. Riyaz, S., Selwyn, P.: Efficient genetic slgorithm based data mining using feature selection with hausdorff distance. Information Technology and Management 6(4), 315–331 (2005)

    Article  Google Scholar 

  7. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. Journal of Chemical Physics, 1087–1092 (1953)

    Google Scholar 

  8. Casillas, A., de Lena, M.T.G., Martínez, R.: Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 43–49. Springer, Heidelberg (2003)

    Google Scholar 

  9. Xu, X.S., Ma, J., Wang, H.: An improved simulated annealing algorithm for the maximum independent set problem. In: Huang, D.-S., Li, K., Irwin, G.W. (eds.) ICIC 2006. LNCS, vol. 4113, pp. 822–831. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  10. Kang, Y.H.: Representative term based feature selection method for svm based document classification. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3681, pp. 56–61. Springer, Heidelberg (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Takashi Washio Zhi-Hua Zhou Joshua Zhexue Huang Xiaohua Hu Jinyan Li Chao Xie Jieyue He Deqing Zou Kuan-Ching Li Mário M. Freire

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bai, X., Sun, J., Che, H., Wang, J. (2007). A General Method of Mining Chinese Web Documents Based on GA&SA and Position-Factors. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77018-3_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77016-9

  • Online ISBN: 978-3-540-77018-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics