Web Mining: Extracting Knowledge from the World Wide Web

  • Zhongzhi Shi
  • Huifang Ma
  • Qing He

This chapter addresses existing techniques for Web mining, which is moving the World Wide Web toward a more useful environment in which users can quickly and easily find the information they need. In particular, this chapter introduces the reader to methods of data mining on the Web developed by our laboratory, including uncovering patterns in Web content (semantic processing, classification, clustering), structure (retrieval, classical link analysis method), and event (preprocessing of Web event mining, news dynamic trace, multi-document summarization analysis). This chapter would be an excellent resource for students and researchers who are familiar with the basic principles of data mining and want to learn more about the application of data mining to their problems in Web mining.


World Wide Swarm Intelligence Term Frequency Vector Space Model Concept Space 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ando R., Kboguraev B., Kbyrd R. J.: Multi-document Summarization by Visualizing Topical Content.ANLP-NAACL 2000 Workshop, Seattle Advanced Summarization Workshop, 2000: 12-19Google Scholar
  2. 2.
    Bing Liu: Web data mining. Springer Verlag, 2007Google Scholar
  3. 3.
    C. Apte, F. Damerau, S. Weiss: Text mining with decision rules and decision trees. In Proceedings of the Conference on Automated Learning and Discovery, Workshop, 1998Google Scholar
  4. 4.
    David C. Luckham, James Vera: An Event-Based Architecture Definition Language. IEEE TRANSANCTION ON Software Engineering, 1995, 21(9): 717–734CrossRefGoogle Scholar
  5. 5.
    Etzioni, Oren: World-Wide Web: Quagmire or gold mine. Communications of the ACM, 1996, 39(11): 65–68CrossRefGoogle Scholar
  6. 6.
    Evans K., Dklavans J., Lmckeown K. R.: Columbia Newsblaster Multilingual news summarization on the Web.Demonstration Papers at HLT-NAACL, 2004: 1–4Google Scholar
  7. 7.
    G. DeJong: Prediction and substantiation: A new approach to natural language processing. Cognitive Science, 1979: 251–273Google Scholar
  8. 8.
    H. Chen, D. T. Ng.: An algorithmic approach to concept exploration in a large knowled-genetwork (automatic thesaurus consultation): symbolic branch-and-bound vs. connection-ist Hopfield net activation. Journal of the American Society for Information Science, 1995, 46(5):348–369CrossRefGoogle Scholar
  9. 9.
    H. Chen, J. Martinez, T. D. Ng, B. R. Schatz: A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System. Journal of the American Society for Information Science, 1997, 48(1): 17–31CrossRefGoogle Scholar
  10. 10.
    J. R. T. Ng, J. Han: Efficient and effective clustering methods for spatial data mining. Proceedings of the 20th VLDB Conference, 1994: 144–155Google Scholar
  11. 11.
    Jia Ziyan, He Qing, Zhang Hai Jun, Li Jiayou, Shi Zhongzhi: A News Event Detection and Tracking Algorithm Based on Dynamic Evolution Model. Journal of Computer Research and Development (in Chinese), 2004, 41(7): 1273–1280Google Scholar
  12. 12.
    Jon M. Kleinberg: Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46(5): 604–632MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Lin Chin Yew, Hovy Eduard: From Single to Multi-document Summarization: A Prototype System and its Evaluation. In Proceedings of ACL, 2002: 25–34Google Scholar
  14. 14.
    M. Ester, H. P. Kriegel, J. Sander, X. Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceeding of the 2nd Internatioal Conference on Knowledge Discovery and Data Mining, 1996: 226–231Google Scholar
  15. 15.
    M. Spiliopoulou: Data mining for the Web. In Proceedings of Principles of Data Mining and Knowledge Discovery. Third European conference, 1999, 588–589Google Scholar
  16. 16.
    Qing He, Ziyan Jia, Jiayou Li,Haijun Zhang,Qingyong Li, Zhongzhi Shi: GHUNT: A SEMANTIC INDEXING SYSTEM BASED ON CONCEPT SPACE. International Conference on Natural Language Processing and Knowledge Engineering (IEEENLP&KE-2003), 2003: 716–721Google Scholar
  17. 17.
    Raymond Kosala, Hendrik Blockeel: Web mining research: a survey. ACM SIGKDD Explorations Newsletter, 2000, 2(1): 1–15CrossRefGoogle Scholar
  18. 18.
    R. Cooley: Web Usage Mining: Discovery and Application of Interesting Patterns from Web data. PhD thesis, Dept. of Computer Science, University of Minnesota. May, 2000Google Scholar
  19. 19.
    Radevr, Jing Hongyan, Budzikowska Malgorzata: Centroid-based summarization of multiple documents Sentence extraction, utility-based evaluationand user studies. ANLP-NAACL 2000 Workshop, 2000: 21–29Google Scholar
  20. 20.
    S. Lu, X. L. Li, S. Bai et al.: An improved approach to weighting terms in text. Journal of Chinese Information Processing (in Chinese), 2000, 14(6): 8–13MATHGoogle Scholar
  21. 21.
    S. K. Madria, S. S. Rhowmich, W. K. Ng, F. P. Lim: Research issues in Web data mining. Proceedings of Data Warehousing and Knowledge Discovery, First International Conference. 1999: 303–312Google Scholar
  22. 22.
    Sergey Brin, Larry Page: The anatomy of a large-scale hypertextual Web search engine. Proceedings of the Seventh International World Wide Web, 1998, 30(7): 107–117Google Scholar
  23. 23.
    Shaohui Liu, Mingkai Dong, Haijun Zhang, Rong Li, Zhongzhi Shi: An approach of multi-hierarchy text classification. International Conferences on Info-tech and Info-net. 2001, 3: 95–100Google Scholar
  24. 24.
    T. Mitchell: Machine Learning. McGraw: Hill, 1996MATHGoogle Scholar
  25. 25.
    Teuvo Kohonen, Samuel Kashi: Self-Organization of a Massive Document Collection. IEEE Transactions On Neural Networks, 2000,11(3): 574–585CrossRefGoogle Scholar
  26. 26.
    V. Vapnik: The Nature of Statistical Learning Theory. New York. Springer-Verlag, 1995MATHGoogle Scholar
  27. 27.
    Wei Wang, Jiong Yang, Richard Muntz: STING: A Statistical Information Grid Approach to Spatial Data Mining. Proceedings of the 23rd VLDB Conference, 1997: 186–195Google Scholar
  28. 28.
    Wu Bin, Zheng Yi, Liu Shaohui, Shi Zhongzhi: CSIM: A Document Clustering Algorithm Based On Swarm Intelligence. World Congress on Computational Intelligence, 2002: 477– 482Google Scholar
  29. 29.
  30. 30.
    X. L. Li, J. M. Liu, Z. Z. Shi: The concept-reasoning network and its application in text classification. Journal of Computer Research and Development (in Chinese), 2000, 37(9): 1032–1038Google Scholar
  31. 31.
    Y. Yang, C. G. Chute: An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 1994, 12(3): 252–277CrossRefGoogle Scholar
  32. 32.
    Y. Yang: Expert Network: Effective and efficient learning from human decisions in text categorization and retrieval. Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SIGIR'94), 1994: 13–22Google Scholar
  33. 33.
    Yuan Li, Qing He, Zhongzhi Shi: Association Retrieval based on concept semantic space. (in Chinese) Journal of University of Science and Technology Beijing, 2001, 23(6): 577–580Google Scholar
  34. 34.
    Zhongzhi Shi, Qing He, Ziyan Jia, Jiayou Li: Intelligence Chinese Document Semantic Indexing System. International Journal of Information Technology and Decision Making, 2003, 2(3): 407–424CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Key Laboratory of Intelligent Information Processing, Institute of Computing TechnologyChinese Academy of SciencesBeijingPeople's Republic of China

Personalised recommendations