Abstract
The explosive growth of the World Wide Web continues to revolutionize information editing, publishing and accessing patterns. Within the Web infrastructure, individuals can easily edit and publish documents that contain hyperlinks to other documents published by the same or other Web sites. As a result, the Web contains information on almost any subject available anywhere to anyone at anytime. However, this explosive information growth has made the task of finding information like trying to find a needle in a haystack. Although directory services (like Yahoo!1) and search engines (like Google2) facilitate information searches, many users still have difficulty locating useful information. Browsing directories is time consuming as there are a seemingly infinite number of possible topics. For example, Open Directory (currently the largest directory database) contains over 460,000 categorics3. Users must click and click and click to find a target directory and browse documents. Furthermore, the construction of directories is labor-intensive and the directory service cannot keep up with Web growth. Finding documents using search engines is frustrating as search results usually contain thousands of links. Although some search engines like Google apply hyperlink analysis to provide better ranking, it is still of ten ineffective.
http://www.yahoo.com/.
http://www.google.com/.
http://dmoz.org/. The Web site contains over 3.8 million sites, 57,238 editors, and over 460,000 categories when I visited the site at June 26, 2003.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proceedings of the ACM SIGMOD International Conference, pages 94–105, 1998.
R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of the ACM SIGMOD International Conference on Management of Data, May 1993.
R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on VLDB, September 1994.
J. Allan, “Relevance Feedback with too much Data,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 337–343, July 1995.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic Local Alignment Search Tool,” Journal of Molecular Biology, 215: 403–410, 1990.
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” Proceedings of the ACM SIGMOD International Conference, pages 49–60, 1999.
C. Apte, F. Damerau, and S. M. Weiss, “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions on Information Systems, 12(3):233–251, July 1994.
R. Baeza-Yates, “Modern Information Retrieval,” Addison Wesley, 1999.
T. Berners-Lee, T. R. Cailliau, etc., “The World-Wide Web,” Communications of the ACM, 37(8):76–82, August 1994.
T. Berners-Lee, “Semantic Web Road Map,” http://www.w3.org/DesignIssues/Semantic.html.
K. Bharat and M. R. Henzinger, “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1998.
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas, “Finding Authorities and Hubs from Link Structures on the World Wide Web,” Proceedings of the 10th International World Wide Web Conference, pages 415–429, 2001.
S. Brin and L. Page, “The Anatomy of a Large-scale Hypertextual Web Search Engine,” Proceedings of the 7th International World Wide Web Conference, 1998.
A. Broder, S. Glassman, M. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proceedings of the 6th International WWW Conference, pages 391–404, 1997.
A. Caglayan and C. Harrison, “Agent Sourcebook—A Complete Guide to Desktop, Internet, and Intranet Agents,” John Wiley & Son, 1997.
C. Cardie, “Empirical Methods in Information Extraction,” AI Magazine, 18(4):5–79, 1997.
S. Chakrabarti, “Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction,” Proceedings of the 10th International World Wide Web Conference, 2001.
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. M. Kleinberg, “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text,” Proceedings of the 7th International World Wide Web Conference, 1998.
S. Chakrabarti, B. Dom, S. Kumar, P Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg, “Mining the Web’s Link Structure,” IEEE Computer, 32(8):60–67, August 1999.
S. Chakrabarti, M. Joshi, and V. Tawde, “Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, 2001.
M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from a Database Perspective,” IEEE Transactions on Knowledge and Data Engineering, 8(6): 866–883, 1996.
L. F. Chien, “PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1997.
B. Chidlovskii, “Wrapper Generation by k-Reversible Grammar Induction,” Workshop on Machine Learning for Information Extraction, August, 2000.
D. W. Chung, U. T. Ng, A. W. Fu, and Y.J. Fu, “Efficient Mining of Association Rules in Distributed Databases,” IEEE Transactions on Knowledge and Data Engineering, 8(6):911–922, December 1996.
P. Clark and T. Niblett, “The CN2 Induction Algorithm,” Machine Learning Journal, 3(4):261–283, 1989.
W. B. Croft and P. Savino, “Implementing Ranking Strategies Using Text Signatures,” ACM Transactions on Office Information Systems, 6(1):42–62, Jan. 1998.
T. G. Dietterich, “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms,” Neural Computation, 10(7):1895–1924, 1998.
R. Doorenbos, O. Etzioni, and D. S. Weld, “A Scalable Comparison-Shopping Agent for the World-Wide Web,” Proceedings of the 1st International Conference on Autonomous Agents, pages 39–48, February 1997.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226–231, 1996.
O. Etzioni, “The World-Wide Web: Quagmire or Gold Mine,” Communications of the ACM, 39(11):65–68. November 1996.
O. Etzioni and M. Perkowitz, “Category Translation: Learning to Understand Information on the Internet,” Proceedings of 15th International Joint conference on AI, pages 930–936, 1995.
W. B. Frakes and R. Baeza-Yates, “Information Retrieval: Data Structures and Algorithms,” Prentice Hall, 1992.
D. Freitag, “Machine Learning for Information Extraction,” Ph.D. Dissertation of Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 1998.
N. Fuhr, “Models for Retrieval with Probabilistic Indexing,” Information Processing and Management, 25(1):55–72, 1989.
S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” Proceedings of the ACM SIGMOD International Conference, pages 73–84, 1998.
S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proceedings of the 15th International Conference on Data Engineering, 1999.
J. Han, Y. Cai, and N. Cercone, “Knowledge Discovery in Databases: An Attribute-Oriented Approach,” Proceedings of the 18th VLDB Conference, pages 547–559, 1992.
J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, and O. R. Zaiane, “DBMiner: A System for Mining Knowledge in Large Relational Databases,” Proceedings of the International Conference on Data Mining and Knowledge Discovery, pages 250–255, 1996.
J. Han and M. Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufinann, 2001.
J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proceedings of the ACM SIGMOD International Conference, pages 486–493,2000.
C. C. Hayes, “Agents in a Nutshell—A Very Brief Introduction,” IEEE Transactions on Knowledge and Data Engineering, 11(1):127–132, Jan/Feb 1999.
C. N. Hsu, and M. T. Dung, “Generating Finite-state Transducers for Semi-structured Data Extraction from the Web,” Information Systems, 23(8):521–538, 1998.
A. Jain, M. Murty, and P. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, 31(3):264–323, 1999.
Y. F. Jing and W. B. Croft, “An Association Thesaurus for Information Retrieval,” http://cobar. cs.umass. edu/info/psfiles/irpubs/jingcroftassocthes.ps.gz, UMass TR 94–17.
T. Kalt and W. B. Croft, “A New Probabilistic Model of Text Classification and Retrieval,” http://cobar. cs.umass.edu/info/psfiles/irpubs/ir.html, UMass Computer Science Technical Report, IR-78, 1996.
M. Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms,” Wiley-Interscience, 2003.
H. Y. Kao, S. H. Lin, J. M. Ho, and M. S. Chen, “Entropy-Based Link Analysis for Mining Web Informative Structures,” the Eleventh International Conference on Information and Knowledge Management (CIKM’02), 2002.
H. Y. Kao, S. H. Lin, J. M. Ho, and M. S. Chen, “Mining Web Informative Structures and Contents Based on Entropy Analysis,” to appear in IEEE Transactions on Knowledge and Data Engineering.
G. Karypis, E.-H. Han, and V. Kumar, “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling,” IEEEComputer, 32(8):68–75, 1999.
J. M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” ACM-SIAM Symposium on Discrete Algorithms, 1998.
R. Kosala and H. Blockeel, “Web Mining Research: A Survey,” SIGKDD Explorations, 2(1):1–15, 2000.
N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAl), 1997.
L. S. Larkey and W. B. Croft, “Combining Classifiers in Text Categorization,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 289–297, 1996.
D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 37–50, 1992.
R. Lempel and S. Moran, “The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect,” Proceedings of the 9th International World Wide Web Conference, May 2000.
S. H. Lin, M. C. Chen, J. M. Ho, and Y. M. Huang, “ACIRD: Intelligent Internet Document Organization and Retrieval,” IEEE Transactions on Knowledge and Data Engineering, 14(3):599–614, May/June 2002.
S. H. Lin and J. M. Ho, “Discovering Informative Content Blocks from Web Documents,” Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
S. H. Lin, C. S. Shih, M. C. Chen, J. M. Ho, M. T. Kao, and Y. M. Huang, “Extracting Classification Knowledge of Internet Documents: A semantics Approach,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 241–249, 1998.
U. Manber and S. Wu, “GLIMPSE: a Tool to Search through Entire File Systems,” Winter USENIX Technical Conference, pages 23–32, USENIX Association, 1994.
S. Madria, S. Bhowmick, W. Ng, and P. Lim, “Research Issues in Web Data Mining,” Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, pages 303–312, 1999.
A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “A Machine Learning Approach to Building Domain-Specific Search Engines,” Proceedings of the 6th International Joint Conference on Artificial Intelligence, pages 662–667, 1999.
M. Mehta, J. Rissanen, and R. Agrawal, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proceedings of the 5th International Conference on Extending Database Technology, 1996.
T. M. Mitchell, “Machine Learning,” McGraw-Hill, 1997.
J. Mostafa, S. Mukhopadhyay, W. Lam, and M. Palakal, “A Multilevel Approach to Intelligent Information Filtering: Model, System, and Evaluation,” ACM Transactions on Information Systems, 15(4):368–399, October 1997.
R. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proceedings of the 20th International Conference on Very Large Databases, 1994.
R. Ng and J. Han, “CLARANS: A Method for Clustering Objects for Spatial Data Mining,” IEEE Transactions on Knowledge and Data Engineering, 14(5):1003–1016, September/October 2002.
S. K. Pal, V. Talwar, and P. Mitra, “Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions,” IEEE Transactions on Neural Networks, 13(5):1163–1177,2002.
J. S. Park, M.-S. Chen, and P. S. Yu, “Using a Hash-Based Method with Transaction Trimming for Mining Association Rules,” IEEE Transactions on Knowledge and Data Engineering, 9(5):813–825, September/October 1997.
G. Piatetsky Shapiro and W. J. Frawley, “Knowledge Discovery in Databases.” AAAI MIT Press, 1991.
M. F. Porter, “An Algorithm for Suffix Stripping,” Program, 14(3):130–137, 1980.
J. R. Quinlan, “Induction of Decision Trees,” Machine Learning, Vol. 1, pages 261–283, 1989.
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers. San Mateo, CA, 1993.
S. Raghavan and H. Garcia-Molina, “Crawling the Hidden Web,” Proceedings of the 27th International Conference on Very Large Data Bases, pages 129–138, 2001.
J. Rennie and A. McCallum, “Using Reinforcement Learning to Spider the Web Efficiently,” Proceedings of the 6th International Conference on Machine Learning, pages 335–343, 1999.
G. Salton, “Automatic Information Organization and Retrieval,” McGraw-Hill, 1968.
G. Salton and C. Buckley, “Term-weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, 24(5):513–523, 1988.
G. Salton and C. Buckley, “Improving Retrieval Performance by Relevance Feedback,” Journal of American Society for Information Science, 41(4):188–297,1990.
G. Salton, A. Wong, and C. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, 18(11):613–620, 1971.
D. Shasha and T. Wang, “New Techniques for Best-Match Retrieval,” ACM Transactions on Office Information Systems, 8(2):140–158, January 1990.
R. Srikant and R. Agrawal, “Mining Generalized Association Rules,” Proceedings of the 21st International Conference on Very Large Databases, pages 407–419, 1995.
R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables,” Proceedings of the ACM SIGMOD International Conference on Management of Data, June 1996.
S. B. Thrun, et al, “The MONK’s Problems A Performance Comparison of Different Learning Algorithms,” Technical report CMU-CS-91-197. Carnegie Mellon University, 1991.
W3C XML, “Extensible Markup Language (XML),” http://www.w3.org/XML/.
K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Transactions on Knowledge and Data Engineering, 12(3):353–371, 2000.
M. Wooldridge and N. Jennings, “Intelligent Agents: Theory and Practice,” Knowledge Engineering Review 10(2):115–152, Cambridge University Press, 1995.
Y. Yang, “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 13–22, 1994.
B. Yuwono, S. L. Y. Lam, J. H. Ying, and D. L. Lee, “A World Wide Web Resource Discovery System,” World Wide Web Journal, 1(1), Winter 1996.
O. R. Zaiane, M. Xin, and J. Han, “Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs,” Proceedings of Advances in Digital Libraries Conference, pages 19–29, 1998.
T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Database,” Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103–114, 1996.
G. K. Zipf, “Human Behavior and the Principle of Least Effort,” Addison Wesley Publishing, Reading, Massachusetts, 1949.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Kluwer Academic Publishers
About this chapter
Cite this chapter
Lin, SH. (2005). Intelligent Internet Information Systems in Knowledge Acquisition: Techniques and Applications. In: Leondes, C.T. (eds) Intelligent Knowledge-Based Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4020-7829-3_5
Download citation
DOI: https://doi.org/10.1007/978-1-4020-7829-3_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-7746-3
Online ISBN: 978-1-4020-7829-3
eBook Packages: Computer ScienceComputer Science (R0)