Intelligent Internet Information Systems in Knowledge Acquisition: Techniques and Applications

Lin, Shian-Hua

doi:10.1007/978-1-4020-7829-3_5

Shian-Hua Lin²

2400 Accesses
1 Citations

Abstract

The explosive growth of the World Wide Web continues to revolutionize information editing, publishing and accessing patterns. Within the Web infrastructure, individuals can easily edit and publish documents that contain hyperlinks to other documents published by the same or other Web sites. As a result, the Web contains information on almost any subject available anywhere to anyone at anytime. However, this explosive information growth has made the task of finding information like trying to find a needle in a haystack. Although directory services (like Yahoo!¹) and search engines (like Google²) facilitate information searches, many users still have difficulty locating useful information. Browsing directories is time consuming as there are a seemingly infinite number of possible topics. For example, Open Directory (currently the largest directory database) contains over 460,000 categorics³. Users must click and click and click to find a target directory and browse documents. Furthermore, the construction of directories is labor-intensive and the directory service cannot keep up with Web growth. Finding documents using search engines is frustrating as search results usually contain thousands of links. Although some search engines like Google apply hyperlink analysis to provide better ranking, it is still of ten ineffective.

http://www.yahoo.com/.

http://www.google.com/.

http://dmoz.org/. The Web site contains over 3.8 million sites, 57,238 editors, and over 460,000 categories when I visited the site at June 26, 2003.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 429.00; Price excludes VAT (USA)

Hardcover Book: USD 549.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proceedings of the ACM SIGMOD International Conference, pages 94–105, 1998.
Google Scholar
R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of the ACM SIGMOD International Conference on Management of Data, May 1993.
Google Scholar
R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on VLDB, September 1994.
Google Scholar
J. Allan, “Relevance Feedback with too much Data,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 337–343, July 1995.
Google Scholar
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic Local Alignment Search Tool,” Journal of Molecular Biology, 215: 403–410, 1990.
Google Scholar
M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” Proceedings of the ACM SIGMOD International Conference, pages 49–60, 1999.
Google Scholar
C. Apte, F. Damerau, and S. M. Weiss, “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions on Information Systems, 12(3):233–251, July 1994.
Article Google Scholar
R. Baeza-Yates, “Modern Information Retrieval,” Addison Wesley, 1999.
Google Scholar
T. Berners-Lee, T. R. Cailliau, etc., “The World-Wide Web,” Communications of the ACM, 37(8):76–82, August 1994.
Google Scholar
T. Berners-Lee, “Semantic Web Road Map,” http://www.w3.org/DesignIssues/Semantic.html.
Google Scholar
K. Bharat and M. R. Henzinger, “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1998.
Google Scholar
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas, “Finding Authorities and Hubs from Link Structures on the World Wide Web,” Proceedings of the 10th International World Wide Web Conference, pages 415–429, 2001.
Google Scholar
S. Brin and L. Page, “The Anatomy of a Large-scale Hypertextual Web Search Engine,” Proceedings of the 7th International World Wide Web Conference, 1998.
Google Scholar
A. Broder, S. Glassman, M. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proceedings of the 6th International WWW Conference, pages 391–404, 1997.
Google Scholar
A. Caglayan and C. Harrison, “Agent Sourcebook—A Complete Guide to Desktop, Internet, and Intranet Agents,” John Wiley & Son, 1997.
Google Scholar
C. Cardie, “Empirical Methods in Information Extraction,” AI Magazine, 18(4):5–79, 1997.
Google Scholar
S. Chakrabarti, “Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction,” Proceedings of the 10th International World Wide Web Conference, 2001.
Google Scholar
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. M. Kleinberg, “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text,” Proceedings of the 7th International World Wide Web Conference, 1998.
Google Scholar
S. Chakrabarti, B. Dom, S. Kumar, P Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg, “Mining the Web’s Link Structure,” IEEE Computer, 32(8):60–67, August 1999.
Google Scholar
S. Chakrabarti, M. Joshi, and V. Tawde, “Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, 2001.
Google Scholar
M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from a Database Perspective,” IEEE Transactions on Knowledge and Data Engineering, 8(6): 866–883, 1996.
Article Google Scholar
L. F. Chien, “PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1997.
Google Scholar
B. Chidlovskii, “Wrapper Generation by k-Reversible Grammar Induction,” Workshop on Machine Learning for Information Extraction, August, 2000.
Google Scholar
D. W. Chung, U. T. Ng, A. W. Fu, and Y.J. Fu, “Efficient Mining of Association Rules in Distributed Databases,” IEEE Transactions on Knowledge and Data Engineering, 8(6):911–922, December 1996.
Article Google Scholar
P. Clark and T. Niblett, “The CN2 Induction Algorithm,” Machine Learning Journal, 3(4):261–283, 1989.
Google Scholar
W. B. Croft and P. Savino, “Implementing Ranking Strategies Using Text Signatures,” ACM Transactions on Office Information Systems, 6(1):42–62, Jan. 1998.
Article Google Scholar
T. G. Dietterich, “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms,” Neural Computation, 10(7):1895–1924, 1998.
Article Google Scholar
R. Doorenbos, O. Etzioni, and D. S. Weld, “A Scalable Comparison-Shopping Agent for the World-Wide Web,” Proceedings of the 1st International Conference on Autonomous Agents, pages 39–48, February 1997.
Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226–231, 1996.
Google Scholar
O. Etzioni, “The World-Wide Web: Quagmire or Gold Mine,” Communications of the ACM, 39(11):65–68. November 1996.
Article Google Scholar
O. Etzioni and M. Perkowitz, “Category Translation: Learning to Understand Information on the Internet,” Proceedings of 15th International Joint conference on AI, pages 930–936, 1995.
Google Scholar
W. B. Frakes and R. Baeza-Yates, “Information Retrieval: Data Structures and Algorithms,” Prentice Hall, 1992.
Google Scholar
D. Freitag, “Machine Learning for Information Extraction,” Ph.D. Dissertation of Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 1998.
Google Scholar
N. Fuhr, “Models for Retrieval with Probabilistic Indexing,” Information Processing and Management, 25(1):55–72, 1989.
Article MathSciNet Google Scholar
S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” Proceedings of the ACM SIGMOD International Conference, pages 73–84, 1998.
Google Scholar
S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proceedings of the 15th International Conference on Data Engineering, 1999.
Google Scholar
J. Han, Y. Cai, and N. Cercone, “Knowledge Discovery in Databases: An Attribute-Oriented Approach,” Proceedings of the 18th VLDB Conference, pages 547–559, 1992.
Google Scholar
J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, and O. R. Zaiane, “DBMiner: A System for Mining Knowledge in Large Relational Databases,” Proceedings of the International Conference on Data Mining and Knowledge Discovery, pages 250–255, 1996.
Google Scholar
J. Han and M. Kamber, “Data Mining: Concepts and Techniques,” Morgan Kaufinann, 2001.
Google Scholar
J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proceedings of the ACM SIGMOD International Conference, pages 486–493,2000.
Google Scholar
C. C. Hayes, “Agents in a Nutshell—A Very Brief Introduction,” IEEE Transactions on Knowledge and Data Engineering, 11(1):127–132, Jan/Feb 1999.
Article MathSciNet Google Scholar
C. N. Hsu, and M. T. Dung, “Generating Finite-state Transducers for Semi-structured Data Extraction from the Web,” Information Systems, 23(8):521–538, 1998.
Article Google Scholar
A. Jain, M. Murty, and P. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, 31(3):264–323, 1999.
Article Google Scholar
Y. F. Jing and W. B. Croft, “An Association Thesaurus for Information Retrieval,” http://cobar. cs.umass. edu/info/psfiles/irpubs/jingcroftassocthes.ps.gz, UMass TR 94–17.
Google Scholar
T. Kalt and W. B. Croft, “A New Probabilistic Model of Text Classification and Retrieval,” http://cobar. cs.umass.edu/info/psfiles/irpubs/ir.html, UMass Computer Science Technical Report, IR-78, 1996.
Google Scholar
M. Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms,” Wiley-Interscience, 2003.
Google Scholar
H. Y. Kao, S. H. Lin, J. M. Ho, and M. S. Chen, “Entropy-Based Link Analysis for Mining Web Informative Structures,” the Eleventh International Conference on Information and Knowledge Management (CIKM’02), 2002.
Google Scholar
H. Y. Kao, S. H. Lin, J. M. Ho, and M. S. Chen, “Mining Web Informative Structures and Contents Based on Entropy Analysis,” to appear in IEEE Transactions on Knowledge and Data Engineering.
Google Scholar
G. Karypis, E.-H. Han, and V. Kumar, “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling,” IEEEComputer, 32(8):68–75, 1999.
Article Google Scholar
J. M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” ACM-SIAM Symposium on Discrete Algorithms, 1998.
Google Scholar
R. Kosala and H. Blockeel, “Web Mining Research: A Survey,” SIGKDD Explorations, 2(1):1–15, 2000.
Article Google Scholar
N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAl), 1997.
Google Scholar
L. S. Larkey and W. B. Croft, “Combining Classifiers in Text Categorization,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 289–297, 1996.
Google Scholar
D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 37–50, 1992.
Google Scholar
R. Lempel and S. Moran, “The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect,” Proceedings of the 9th International World Wide Web Conference, May 2000.
Google Scholar
S. H. Lin, M. C. Chen, J. M. Ho, and Y. M. Huang, “ACIRD: Intelligent Internet Document Organization and Retrieval,” IEEE Transactions on Knowledge and Data Engineering, 14(3):599–614, May/June 2002.
Article Google Scholar
S. H. Lin and J. M. Ho, “Discovering Informative Content Blocks from Web Documents,” Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
Google Scholar
S. H. Lin, C. S. Shih, M. C. Chen, J. M. Ho, M. T. Kao, and Y. M. Huang, “Extracting Classification Knowledge of Internet Documents: A semantics Approach,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 241–249, 1998.
Google Scholar
U. Manber and S. Wu, “GLIMPSE: a Tool to Search through Entire File Systems,” Winter USENIX Technical Conference, pages 23–32, USENIX Association, 1994.
Google Scholar
S. Madria, S. Bhowmick, W. Ng, and P. Lim, “Research Issues in Web Data Mining,” Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, pages 303–312, 1999.
Google Scholar
A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “A Machine Learning Approach to Building Domain-Specific Search Engines,” Proceedings of the 6th International Joint Conference on Artificial Intelligence, pages 662–667, 1999.
Google Scholar
M. Mehta, J. Rissanen, and R. Agrawal, “SLIQ: A Fast Scalable Classifier for Data Mining,” Proceedings of the 5th International Conference on Extending Database Technology, 1996.
Google Scholar
T. M. Mitchell, “Machine Learning,” McGraw-Hill, 1997.
Google Scholar
J. Mostafa, S. Mukhopadhyay, W. Lam, and M. Palakal, “A Multilevel Approach to Intelligent Information Filtering: Model, System, and Evaluation,” ACM Transactions on Information Systems, 15(4):368–399, October 1997.
Google Scholar
R. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proceedings of the 20th International Conference on Very Large Databases, 1994.
Google Scholar
R. Ng and J. Han, “CLARANS: A Method for Clustering Objects for Spatial Data Mining,” IEEE Transactions on Knowledge and Data Engineering, 14(5):1003–1016, September/October 2002.
Article Google Scholar
S. K. Pal, V. Talwar, and P. Mitra, “Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions,” IEEE Transactions on Neural Networks, 13(5):1163–1177,2002.
Article Google Scholar
J. S. Park, M.-S. Chen, and P. S. Yu, “Using a Hash-Based Method with Transaction Trimming for Mining Association Rules,” IEEE Transactions on Knowledge and Data Engineering, 9(5):813–825, September/October 1997.
Article Google Scholar
G. Piatetsky Shapiro and W. J. Frawley, “Knowledge Discovery in Databases.” AAAI MIT Press, 1991.
Google Scholar
M. F. Porter, “An Algorithm for Suffix Stripping,” Program, 14(3):130–137, 1980.
Google Scholar
J. R. Quinlan, “Induction of Decision Trees,” Machine Learning, Vol. 1, pages 261–283, 1989.
Google Scholar
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers. San Mateo, CA, 1993.
Google Scholar
S. Raghavan and H. Garcia-Molina, “Crawling the Hidden Web,” Proceedings of the 27th International Conference on Very Large Data Bases, pages 129–138, 2001.
Google Scholar
J. Rennie and A. McCallum, “Using Reinforcement Learning to Spider the Web Efficiently,” Proceedings of the 6th International Conference on Machine Learning, pages 335–343, 1999.
Google Scholar
G. Salton, “Automatic Information Organization and Retrieval,” McGraw-Hill, 1968.
Google Scholar
G. Salton and C. Buckley, “Term-weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, 24(5):513–523, 1988.
Article Google Scholar
G. Salton and C. Buckley, “Improving Retrieval Performance by Relevance Feedback,” Journal of American Society for Information Science, 41(4):188–297,1990.
Article Google Scholar
G. Salton, A. Wong, and C. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, 18(11):613–620, 1971.
Article Google Scholar
D. Shasha and T. Wang, “New Techniques for Best-Match Retrieval,” ACM Transactions on Office Information Systems, 8(2):140–158, January 1990.
Article Google Scholar
R. Srikant and R. Agrawal, “Mining Generalized Association Rules,” Proceedings of the 21st International Conference on Very Large Databases, pages 407–419, 1995.
Google Scholar
R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables,” Proceedings of the ACM SIGMOD International Conference on Management of Data, June 1996.
Google Scholar
S. B. Thrun, et al, “The MONK’s Problems A Performance Comparison of Different Learning Algorithms,” Technical report CMU-CS-91-197. Carnegie Mellon University, 1991.
Google Scholar
W3C XML, “Extensible Markup Language (XML),” http://www.w3.org/XML/.
Google Scholar
K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Transactions on Knowledge and Data Engineering, 12(3):353–371, 2000.
Article Google Scholar
M. Wooldridge and N. Jennings, “Intelligent Agents: Theory and Practice,” Knowledge Engineering Review 10(2):115–152, Cambridge University Press, 1995.
Article Google Scholar
Y. Yang, “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the ACM SIGIR International Conference on Information Retrieval, pages 13–22, 1994.
Google Scholar
B. Yuwono, S. L. Y. Lam, J. H. Ying, and D. L. Lee, “A World Wide Web Resource Discovery System,” World Wide Web Journal, 1(1), Winter 1996.
Google Scholar
O. R. Zaiane, M. Xin, and J. Han, “Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs,” Proceedings of Advances in Digital Libraries Conference, pages 19–29, 1998.
Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Database,” Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103–114, 1996.
Google Scholar
G. K. Zipf, “Human Behavior and the Principle of Least Effort,” Addison Wesley Publishing, Reading, Massachusetts, 1949.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Chi Nan University, Taiwan, Republic of China
Shian-Hua Lin

Authors

Shian-Hua Lin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of California, Los Angeles, USA
Cornelius T. Leondes

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lin, SH. (2005). Intelligent Internet Information Systems in Knowledge Acquisition: Techniques and Applications. In: Leondes, C.T. (eds) Intelligent Knowledge-Based Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4020-7829-3_5

Download citation

DOI: https://doi.org/10.1007/978-1-4020-7829-3_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-7746-3
Online ISBN: 978-1-4020-7829-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics