Chinese Document Clustering Using Self-Organizing Map-Based on Botanical Document Warehouse

  • Howard Lo
  • Che-Chern Lin
  • Rong-Jyue Fang
  • Chungping Lee
  • Yu-Chen Weng
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 28)


The exponential growth of information has made an overflow situation in the sea of information. It had created difficulties in the search for information. An efficient method to organize the query of information and assist users’ navigation is therefore particularly important. In this paper, we applied Self-Organizing Map (SOM) algorithm to cluster Chinese botanical documents onto a two-dimensional map. Each botanical document has been regarded as bags of words, and transferred into plain text respectively. We applied term frequency and inverse term frequency to extract key terms from documents as the input of SOM. 892 Chinese botanical documents have been projected onto a 2D map to assist users’ navigation. In our experimental results, the lowest recall was 0.71 for Polygonaceae documents and the highest recall rate was 0.94 for Amaranthaceae documents. The lowest precision rate was 0.81 for Umbelliferae documents, and the highest precision rate was one hundred percent for Convolvulaceae and Cruciferae documents.


Recall Rate Vector Space Model Precision Rate Unstructured Data Chinese Term 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Kohonen T, Kaski S, Lagus K, Salojärvi J, Honkela J, Paatero V, Saarela A (2000) Self-organization of a massive document collection. IEEE Trans Neural Netw 11:574–585CrossRefGoogle Scholar
  2. 2.
    Fang R-J, Lo H, Weng Y-C, Tsai HL (2007) Mobile learning system using multi-dimension data warehouse concept-based on botanical data. In: Proceeding of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, ChinaGoogle Scholar
  3. 3.
    Sullivan D (2001) Document warehousing and text mining. John Wiley & Sons, Inc .Google Scholar
  4. 4.
    Trybula WJ (1999) Text mining and knowledge discernment: An exploratory investigation. Ph.D., The University of Texas at AustinGoogle Scholar
  5. 5.
    Porter M (1980) Readings in information retrieval. Morgan Kaufmann, San FranciscoGoogle Scholar
  6. 6.
    Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18:613–620CrossRefMATHGoogle Scholar
  7. 7.
    Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw Hill, New YorkMATHGoogle Scholar
  8. 8.
    Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24:513–532CrossRefGoogle Scholar
  9. 9.
    Salton G, Singhal A (1995) Automatic text browsing using vector space model. Department of Computer Science, Cornell UniversityGoogle Scholar
  10. 10.
    Chen H, Ng T (1995) An algorithmic approach to concept exploration in a large knowledge network (Automatic Thesaurus Consultation): Symbolic Branch-and-Bound Search vs. Connectionist Hopield Net Activation. J Am Soc Inf Sci 45:348–369CrossRefGoogle Scholar
  11. 11.
    Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: IJCAI-99 Workshop on Machine Learning for Information Filtering, Stockholm, Sweden, pp 61–67Google Scholar
  12. 12.
    Mladenic D, Grobelnik Marko (2003) Feature selection on hierarchy of web documents. Decision Support Sys 35:45–87CrossRefGoogle Scholar
  13. 13.
    Chen C-M, Lee H-M, Tan C-C (2006) An intelligent web-page classifier with fair feature-subset selection. Eng Appl Artif Intell 19:967–978CrossRefGoogle Scholar
  14. 14.
    Kohonen T (1995) Self-organizing maps. Springer, Berlin, HeidelbergGoogle Scholar
  15. 15.
    Lagus K, Honkela T, Kaski S, Kohonen Teuvo (1996) Self-organizing maps of document collections: A new approach to interactive exploration. In: Proceedings of the Second International Conference on Knowledge Discovery & Data MiningGoogle Scholar
  16. 16.
    Kohonen T, Kaski S, Lagus K, Salojärvi J, Honkela J (2000) Self organization of a massive document collection. IEEE Trans on Neural Netw 11:574–585CrossRefGoogle Scholar
  17. 17.
    Guerrero VP, Moya-Anego´n FL, and Herrero-Solana V (2002) Automatic extraction of relationships between terms by means of Kohonen's algorithm. Libr Inf Sci Res 24:235–250CrossRefGoogle Scholar
  18. 18.
    Smith KA, Ng A (2003) Web page clustering using a self-organizing map of user navigation patterns. Decision Support Sys 35:245–256CrossRefGoogle Scholar
  19. 19.
    Azcarraga AP, Y Jr TN, Tan J, Chua TS (2004) Evaluating keyword selection methods for WEBSOM text archives. IEEE Trans Knowl Data Eng 16:380–283Google Scholar
  20. 20.
    Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval.Google Scholar
  21. 21.
    Marchionini G, Shneiderman B (1988) Finding facts vs. browsing knowledge in hyptertext systems. IEEE Comput 70–79Google Scholar
  22. 22.
    Sim SE, Clarke CLA, Holt RC, Cox AM (1999) Browsing and searching software architectures. In: Proceedings of the International Conference on Software Maintenance, Oxford, EnglandGoogle Scholar
  23. 23.
    T. Academia Sinica (2007) Herbarium, research center for biodiversity. Available at
  24. 24.
    Hsu LM, Chiang MY (2007) Lawn weeds in Taiwan. Available at
  25. 25.
    Institute Linguistis (2007) CKIP Chinese term segmentation system. Academia SinicaGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Howard Lo
    • 1
  • Che-Chern Lin
    • 1
  • Rong-Jyue Fang
    • 2
  • Chungping Lee
    • 1
  • Yu-Chen Weng
    • 1
  1. 1.Department of Industrial Technology EducationNational Kaohsiung Normal UniversityKaohsiung CityTaiwan
  2. 2.Department of Information ManagementNational Kaohsiung Normal UniversityKaohsiung CityTaiwan

Personalised recommendations