Mining for Information Discovery on the Web: Overview and Illustrative Research

  • Hwanjo Yu
  • AnHai Doan
  • Jiawei Han
Chapter

Abstract

The Web has become a fertile ground for numerous research activities in mining. In this chapter, we discuss research on finding targeted information on the Web. First, we briefly survey the research area. We focus in particular on two key issues: (a) mining to impose structures over Web data, by building taxonomies and portals for example, to aid in Web navigation, and (b) mining to build information processing systems, such as search engines, question answering systems, and data integration systems. Next, we describe two recent Web mining projects that illustrate the use of mining techniques to address the above two key issues. We conclude by briefly discussing novel research opportunities in the area of mining for information discovery on the Web.

Keywords

Shrinkage Milo Haas Glean 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 7.1
    R. Ananthakrishna, S. Chaudhuri, V. Ganti: Eliminating fuzzy duplicates in data warehouses. In: Proc. of 28th Int. Conf. on Very Large Databases (2002) Google Scholar
  2. 7.2
    N. Ashish, C. Knoblock: Wrapper Generation for Semi-structured Informa-tion Sources. In: Proc. ACM SIGMOD Workshop on Management of Semi-structured Data (1997) Google Scholar
  3. 7.3
    R. Avnur, J. Hellerstein: Continuous query optimization. In: SIGMOD ‘00 (2000) Google Scholar
  4. 7.4
    C. Batini, M. Lenzerini, SB. Navathe: A comparative analysis of methodologies for database schema integration. ACM Computing Survey, 18 (4), 323–364 (1986)CrossRefGoogle Scholar
  5. 7.5
    J. Berlin, A. Motro: Autoplex: Automated discovery of content for virtual databases. In: Proc. of the Conf. on Cooperative Information Systems (CoopIS) (2001) Google Scholar
  6. 7.6
    J. Berlin, A. Motro: Database schema matching using machine learning with feature selection. In: Proc. of the Conf. on Advanced Information Systems Engineering (CAiSE) (2002)164 H. Yu, A.H. Doan, J.W. HanGoogle Scholar
  7. 7.7
    M. Bilenko, R. Mooney: Learning to combine trained distance metrics for duplicate detection in databases. Technical Report Technical Report AI 02–296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX (February 2002)Google Scholar
  8. 7.8
    C.J.C. Burges: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167 (1998)CrossRefGoogle Scholar
  9. 7.9
    J. Callan, M. Connell, A. Du: Automatic discovery of language models for text databases. In: Proc. of the ACM SIGMOD Conf. (SIGMOD) (1999)Google Scholar
  10. 7.10
    S. Chakrabarti: Data mining for hypertext: A tutorial survey. In: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery and Data Mining, ACM, 1 (2000) Google Scholar
  11. 7.11
    S. Chakrabarti: Mining the Web: Discovering Knowledge from Hypertext Data (Morgan Kaufmann Publishers (2002))Google Scholar
  12. 7.12
    S. Chakrabarti, M. Berg, B. Dom: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, (Amsterdam, Netherlands, 1999) 31 (11–16), 1623–1640 (1999)Google Scholar
  13. 7.13
    S. Chakrabarti, B. Dom, R. Agrawal, P. Raghavan: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. Journal of Very Large Data Bases, 7 (3), 163–178 (1998)CrossRefGoogle Scholar
  14. 7.14
    S. Chakrabarti, B. Dom, P. Indyk: Enhanced Hypertext Categorization Using Hyperlinks. In: Proc. of the ACM SIGMOD Conf. (1998) Google Scholar
  15. 7.15
    H. Chalupsky: Ontomorph: A Translation system for symbolic knowledge. Principles of Knowledge Representation and Reasoning (2000)Google Scholar
  16. 7.16
    C.C. Chang, C.J. Lin: Training nu-support vector classifiers: theory and algorithms. Neural Computation, 13, 2119–2147 (2001)MATHCrossRefGoogle Scholar
  17. 7.17
    K. Chang, B. He, C. Li, Z. Zhang: Structured databases on the Web: Observations and implications. Technical Report UIUCDCS-R-2003–2321, Department of Computer Science, UIUC (February 2003)Google Scholar
  18. 7.18
    H. Chen, S. Dumais: Bringing order to the Web: automatically categorizing search results. In: Proc. of CHI-00, Human Factors in Computing Systems, Den Haag, NL, 2000 (Forthcoming)Google Scholar
  19. 7.19
    J. Chen, D. DeWitt, F. Tian, Y. Wang: Niagaracq: A scalable continuous query system for internet databases. In: SIGMOD ‘00 (2000) Google Scholar
  20. 7.20
    J. Cho, A. Ntoulas: Effective change detection using sampling (2002)Google Scholar
  21. 7.21
    W. Cohen: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Procceedings of SIGMOD-98(1998) Google Scholar
  22. 7.22
    W. Cohen, M. Hurst, L. Jensen: A flexible learning system for wrapping tables and lists in html documents. In: Proc. of the Int. World-Wide Web Conf. (WWW) (2002) Google Scholar
  23. 7.23
    W. Cohen, D. Kudenko: Transferring and retraining learned information filters. In: Proc. of the AAAI Conf. (AAAI-97) (1997) Google Scholar
  24. 7.24
    W. Cohen, J. Richman: Learning to match and cluster entity names. In: Proc. of 8th ACM SIGKDD Int. Conf on Knowledge Discovery and Data Mining (2002) Google Scholar
  25. 7.25
    C. Cortes, V. Vapnik: Support vector networks. Machine Learning, 30 (3), 273–297 (1995)Google Scholar
  26. 7.26
    M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery: Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118 (1–2), 69–113 (2000)MATHCrossRefGoogle Scholar
  27. 7.27
    V. Crescenzi, G. Mecca, P. Merialdo: Roadrunner: Towards automatic data extraction from large Web sites. VLDB Journal, pp. 109–118 (2001)Google Scholar
  28. 7.28
    F. DeComite, F. Denis, R. Gilleron: Positive and unlabeled examples help learning. In: Proc. 11th Int. Conf. Algorithmic Learning Theory (ALT’99) ( Tokyo, Japan, 1999 ) pp. 219–230Google Scholar
  29. 7.29
    F. Denis: PAC learning from positive statistical queries. In: Proc. 10th Int. Conf. Algorithmic Learning Theory (ALT’99) ( Otzenhausen, Germany, 1998 ) pp. 112–126CrossRefGoogle Scholar
  30. 7.30
    H. Do, E. Rahm: Coma: A system for flexible combination of schema matching approaches. In: Proc. of the 28th Conf. on Very Large Databases (VLDB) (2002) Google Scholar
  31. 7.31
    A. Doan, P. Domingos, A. Halevy: Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach. In: Proc. of the ACM SIGMOD Conf. (2001) Google Scholar
  32. 7.32
    A. Doan, Y. Lu, Y. Lee, J. Han: Object matching for data integration: A profile-based approach. In: Proc. of the IJCAI-03 Workshop on Information Integration on the Web (2003) Google Scholar
  33. 7.33
    A. Doan, J. Madhavan, P. Domingos, A. Halevy: Learning to map ontologies on the Semantic Web. In: Proc. of the World-Wide Web Conf. (WWW-02) (2002) Google Scholar
  34. 7.34
    S. Dumais, H. Chen: Hierarchical classification of Web content. In: Proc. 23rd ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR’00) (Athens, Greece) pp. 256–263 (2000)Google Scholar
  35. 7.35
    D. Embley, D. Jackman, L. Xu: Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Proc. of the WIIW-01 (2001) Google Scholar
  36. 7.36
    D. Embley, Y. Jiang, Y. Ng: Record-boundary discovery in Web documents. In: Proc. of the ACM SIGMOD Conf. (1999) Google Scholar
  37. 7.37
    D. Florescu, A. Levy, A. Mendelzon: Database techniques for the World-Wide Web: A survey. SIGMOD Record, 27 (3), 59–74 (1998)CrossRefGoogle Scholar
  38. 7.38
    D. Freitag: Multistrategy learning for information extraction. In: Proc. 15th Int. Conf. on Machine Learning (ICML-98) (1998) Google Scholar
  39. 7.39
    H. Galhardas, D. Florescu, D. Shasha, E. Simon: An extensible framework for data cleaning. In: Proc. of 16th Int. Conf. on Data Engineering (2000) Google Scholar
  40. 7.40
    H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, J. Widom: The TSIMMIS project: Integration of heterogeneous information sources. Journal of Intelligent Inf. Systems, 8 (2) (1997)Google Scholar
  41. 7.41
    C. Giles, K. Bollacker, S. Lawrence: CiteSeer: An automatic citation indexing system. In: Digital Libraries 98 - The 3rd ACM Conf. on Digital Libraries (1998) Google Scholar
  42. 7.42
    L. Gravano, P. Ipeirotis, N. Koudas, D: Srivastava. Text join for data cleansing and integration in an rdbms. In: Proc. of 19th Int. Conf. on Data Engineering (2003) Google Scholar
  43. 7.43
    L.M. Haas, D. Kossmann, E.L. Wimmers, J. Yang: Optimizing queries across diverse data sources. In: Proc. of VLDB ‘87 (1997) Google Scholar
  44. 7.44
    J. Han, K. Chang: Data mining for Web intelligence. IEEE Computer, 2002 Google Scholar
  45. 7.45
    B. He, K. Chang: Statistical schema matching across Web query interfaces. In: Proc. of the ACM SIGMOD Conf. (SIGMOD) (2003) Google Scholar
  46. 7.46
    M. Hernandez, S. Stolfo: The merge/purge problem for large databases. In: SIGMOD Conf, 1995 pp. 127–138Google Scholar
  47. 7.47
    P. Ipeirotis, L. Gravano, M. Sahami: Probe, count, and classify: Categorizing hidden Web databases. In: Proc. of the ACM SIGMOD Conf. (SIGMOD) (2001) Google Scholar
  48. 7.48
    Z. Ives, D. Florescu, M. Friedman, A. Levy, D. Weld: An adaptive query execution system for data integration. In: Proc. of SIGMOD (1999) 166 H. Yu, A.H. Doan, J.W. HanGoogle Scholar
  49. 7.49
    T. Joachims: Text categorization with support vector machines. In: Proc. 10th European Conf. on Machine Learning (ECML’98) ( Chemnitz, Germany, 1998 ) pp. 137–142Google Scholar
  50. 7.50
    J. Kang, J. Naughton: On schema matching with opaque column names and data values. In: Proc. of the ACM SIGMOD Int. Conf. on Management of Data (SIGMOD-03) (2003) Google Scholar
  51. 7.51
    J. Kleinberg: Authoritative sources in a hyperlinked environment. In: Proc. 9th ACM-SIAM Symposium on Discrete Algorithms (1998) Google Scholar
  52. 7.52
    D. Koller, M. Sahami: Hierarchically classifying documents using very few words. In: Proc. 14th Int. Conf. on Machine Learning (Morgan Kaufmann, 1997) pp. 170–178Google Scholar
  53. 7.53
    R. Kosala, H. Blockeel: Web mining research: A survey. SIGKDD: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery and Data Mining, 2 (2000) Google Scholar
  54. 7.54
    N. Kushmerick: Wrapper verification. World Wide Web Journal, 3 (2), 79–94 (2000)MATHCrossRefGoogle Scholar
  55. 7.55
    N. Kushmerick, D. Weld, R. Doorenbos: Wrapper Induction for Information Extraction. In: Proc. of the Int. Joint Conf. on AI (IJCAI) (1997) Google Scholar
  56. 7.56
    E. Lambrecht, S. Kambhampati, S. Gnanaprakasam: Optimizing recursive information gathering plans. In: Proc. of the Int. Joint Conf. on AI (IJCAI) (1999) Google Scholar
  57. 7.57
    S. Lawrence, K. Bollacker, C.L. Giles: Autonomous citation matching. In: Proc. of the 3rd Int. Conf. on Autonomous Agents (1999) Google Scholar
  58. 7.58
    W. Lehnert: A conceptual theory of question answering. In: B. Grosz, K. Jones, B. Webber (eds.), Natural Language Processing (Kaufmann, 1986 )Google Scholar
  59. 7.59
    K. Lerman, S. Minton, C. Knoblock: Wrapper maintenance: A machine learn-ing approach. Journal of Artificial Intelligence Research (2003)Google Scholar
  60. 7.60
    F. Letouzey, F. Denis, R. Gilleron: Learning from positive and unlabeled examples. In: Proc. 11th Int. Conf. Algorithmic Learning Theory (ALT’00), Sydney, Australia, 2000 pp. 11–30Google Scholar
  61. 7.61
    A.Y. Levy, A. Rajaraman, J. Ordille: Querying heterogeneous information sources using source descriptions. In: Proc. of VLDB (1996) Google Scholar
  62. 7.62
    W. Li, C. Clifton: SEMINT: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33, 49–84 (2000)MATHCrossRefGoogle Scholar
  63. 7.63
    W. Li, J. Han, J. Pei. CMAR: Accurate and efficient classification based on multiple class-association rules. In: Proc. of the Int. Conf. on Data Mining (ICDM-01) (2001) Google Scholar
  64. 7.64
    M. Light, G. Mann, E. Riloff, E. Breck: Analyses for elucidating current ques- tion answering technology. Journal for Natural Language Engineering (2001)Google Scholar
  65. 7.65
    B. Liu, W. S. Lee, P. S. Yu, X. Li: Partially supervised classification of text documents. In: Proc. 19th Int. Conf. Machine Learning (ICML’02), Sydney, Australia, 2002 pp. 387–394Google Scholar
  66. 7.66
    J. Madhavan, P. Bernstein, K. Chen, A. Halevy, P. Shenoy: Matching schemas by learning from a schema corpus. In: Proc. of the IJCAI-03 Workshop on Information Integration on the Web (2003) Google Scholar
  67. 7.67
    J. Madhavan, P.A. Bernstein, E. Rahm: Generic schema matching with cupid.In: Proc. of the Int. Con. on Very Large Databases (VLDB) (2001) Google Scholar
  68. 7.68
    S. Madria, S. Bhowmick, W. Ng, E. Lim: Research issues in Web data mining. In: Data Warehousing and Knowledge Discovery, pp. 303–312 (1999)Google Scholar
  69. 7.69
    L. M. Manevitz, M. Yousef: One-class SVMs for document classification. Jour-nal of Machine Learning Research, 2, 139–154 (2001)Google Scholar
  70. 7.70
    A. McCallum, K. Nigam, J. Rennie, K. Seymore: A machine learning approach to building domain-specific search engines. In: Proc. of the Int. Joint Conf. on AI (IJCAI) (1999) Google Scholar
  71. 7.71
    A. McCallum, K. Nigam, J. Rennie, K. Seymore: Automating the construction of internet portals with machinelearning. Information Retrieval, 3 (2), 127–163 (2000)CrossRefGoogle Scholar
  72. 7.72
    A. McCallum, K. Nigam, L. Ungar: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2000) Google Scholar
  73. 7.73
    A. McCallum, R. Rosenfeld, T. Mitchell, A.Y. Ng: Improving text classifica- tion by shrinkage in a hierarchy of classes (Madison, WI, 1998) pp. 359–367Google Scholar
  74. 7.74
    D. McGuinness, R. Fikes, J. Rice, S. Wilder: The Chimaera Ontology Environment. In: Proc. of the 17th National Conf. on Artificial Intelligence (2000) Google Scholar
  75. 7.75
    S. Melnik, H.M. Garcia, E. Rahm: Similarity Flooding: A Versatile Graph Matching Algorithm. In: Proc. of the Int. Conf. on Data Engineering (ICDE) (2002) Google Scholar
  76. 7.76
    T. Milo, S. Zohar: Using schema matching to simplify heterogeneous data translation. In: Proc. of VLDB (1998) Google Scholar
  77. 7.77
    P. Mitra, G. Wiederhold, J. Jannink: Semi-automatic Integration of Knowledge Sources. In: Proc. of Fusion’99 (1999) Google Scholar
  78. 7.78
    A. Monge, C. Elkan: The field matching problem: Algorithms and applications. In: Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (1996) Google Scholar
  79. 7.79
    F. Neumann, CT. Ho, X. Tian, L. Haas, N. Meggido: Attribute classification using feature analysis. In Proc. of the Int. Conf. on Data Engineering (ICDE) (2002) Google Scholar
  80. 7.80
    J. Neville, D. Jensen: Iterative classification in relational data (2000)Google Scholar
  81. 7.81
    K. Nigam: Using unlabeled data to improve text classification. Ph.D. thesis, Carnegie-Mellon University, School of Computer Science (2001)Google Scholar
  82. 7.82
    K. Nigam, A. McCallum, S. Thrun, T. Mitchell: Learning to classify text from labeled and unlabeled documents. In: Proc. of the Nat. Conf. on AI (AAAI) (1998) Google Scholar
  83. 7.83
    N.F. Noy, M.A. Musen: PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment. In: Proc. of the National Conf. on Artificial Intelligence (AAAI) (2000) Google Scholar
  84. 7.84
    N.F. Noy, M.A. Musen: PromptDiff: A fixed-point algorithm for comparing ontology versions. In: Proc. of the Nat. Conf. on Artificial Intelligence (AAAI) (2002) Google Scholar
  85. 7.85
    L. Page, S. Brin, R. Motwani, T. Winograd: The pagerank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project (1998)Google Scholar
  86. 7.86
    L. Palopoli, D. Sacca, D. Ursino: Semi-automatic, semantic discovery of properties from database schemes. In: Proc. of the Int. Database Engineering and Applications Symposium (IDEAS-98), 1998 pp. 244–253Google Scholar
  87. 7.87
    E. Rahm, P.A. Bernstein: On matching schemas automatically. VLDB Journal, 10 (4) (2001)Google Scholar
  88. 7.88
    V. Raman, J. Hellerstein: Potter’s wheel: An interactive data cleaning system. VLDB Journal, pp. 381–390 (2001)Google Scholar
  89. 7.89
    A. Rosenthal, S. Renner, L. Seligman, F. Manola: Data integration needs an industrial revolution. In: Proc. of the Workshop on Foundations of Data Integration (2001) Google Scholar
  90. 7.90
    S. Sarawagi, A. Bhamidipaty: Interactive deduplication using active learning. In: Proc. of 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (2002) 168 H. Yu, A.H. Doan, J.W. HanGoogle Scholar
  91. 7.91
    S. Sizov, M. Theobald, S. Siersdorfer, G. Weikum, J. Graupmann, M. Biwer, P. Zimmer: The Bingo! system for information portal generation and expert Web search. In: Proc. of the Conf. on Innovative Database Research (CIDR03) (2003) Google Scholar
  92. 7.92
    S. Slattery, T. Mitchell: Discovering test set regularities in relational domains. In: Proc. of the 17th Int. Conf. on Machine Learning (ICML) (2000) Google Scholar
  93. 7.93
    D.M.J. Tax, R.P.W. Duin: Support vector domain description. Pattern Recog-nition Letters, 20, 1991–1999 (1999)Google Scholar
  94. 7.94
    D.M.J. Tax, R.P.W. Duin: Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research, 2, 155–173 (2001)Google Scholar
  95. 7.95
    S. Tejada, C. Knoblock, S. Minton: Learning domain-independent string transformation weights for high accuracy object identification. In: Proc. of the 8th SIGKDD Int. Conf. (KDD-2002) (2002) Google Scholar
  96. 7.96
    L.L. Yan, R.J. Miller, L.M. Haas, R. Fagin: Data Driven Understanding and Refinement of Schema Mappings. In: Proc. of the ACM SIGMOD (2001) Google Scholar
  97. 7.97
    Y. Yang, X. Liu: A re-examination of text categorization methods. In: Proc. 22th ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR’99), Berkeley, CA, 1999 pp. 42–49Google Scholar
  98. 7.98
    W. Yih, D. Roth: Probabilistic reasoning for entity and relation recognition. In: Proc. of COLING’02 (2002) Google Scholar
  99. 7.99
    H. Yu: SVMC: Single-class classification with support vector machines. In: Proc. Int. Joint Conf. on Articial Intelligence (IJCAI-03), Acapulco, Mexico (2003) Google Scholar
  100. 7.100
    H. Yu, J. Han, K. Chang: PEBL: Positive Example Based Learning for Web page classification using svm. In: Proc. of the Conf. on Knowledge Discovery and Data Mining, KDD (2002) Google Scholar
  101. 7.101
    O. Zamir, O. Etzioni: Web document clustering: A feasibility demonstration. In: Proc. of the 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrival (August 1998)Google Scholar
  102. 7.102
    O. Zamir, O. Etzioni, O. Madani, R.M. Karp: Fast and intuitive clustering of Web documents. In: Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining pp. 287–290 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Hwanjo Yu
    • 1
  • AnHai Doan
    • 1
  • Jiawei Han
    • 1
  1. 1.Department of Computer ScienceThomas M. Siebel Center for Computer ScienceUSA

Personalised recommendations