Abstract
Many text databases on the web are “hidden” behind search interfaces, and their documents are only accessible through querying. Traditional search engines typically ignore the contents of such searchonly databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier’s rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In SIGMOD 1999, Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, pages 479–490. ACM Press, 1999.
William W. Cohen. Fast effective rule induction. In Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning (ICML’95), Tahoe City, California, USA, pages 115–123. Morgan Kaufmann, 1995.
William W. Cohen. Learning trees and rules with set-valued features. In Proceedings of the 13th National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, volume 1, pages 709–716. American Association for Artificial Intelligence, AAAI Press / The MIT Press, 1996.
Susan Gauch, Guijun Wang, and Mario Gomez. Profusion*: Intelligent fusion from multiple, distributed search engines. The Journal of Universal Computer Science, 2(9):637–649, September 1996.
Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford proposal for Internet meta-searching. In SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, pages 207–218. ACM Press, 1997.
Luis Gravano, Héctor García-Molina, and Anthony Tomasic. GlOSS: Text-Source discovery over the Internet. ACM Transactions on Database Systems, 24(2):229–264, June 1999.
David Hawking and Paul B. Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40–76, January 1999.
Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical Report CMU-CS-96-118, School of Computer Science, Carnegie Mellon University, March 1996.
Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Machine Learning, Proceedings of the Thirteenth International Conference (ICML’96), Bari, Italy, pages 284–292. Morgan Kaufmann, 1996.
Daphne Koller and Mehran Sahami. Hierarchically classifying documents using very few words. In Machine Learning, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), Nashville, Tennessee, USA, pages 170–178. Morgan Kaufmann, 1997.
Steve Lawrence and C. Lee Giles. Inquirus, the NECI meta search engine. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, pages 95–105, 1998.
Weiyi Meng, King-Lup Liu, Clement T. Yu, Xiaodong Wang, Yuhsi Chang, and Naphtali Rishe. Determining text databases to search in the Internet. In VLDB’98, Proceedings of the 24th International Conference on Very Large Data Bases, New York City, New York, USA, pages 14–25. Morgan Kaufmann, 1998.
Weiyi Meng, Clement T. Yu, and King-Lup Liu. Detection of heterogeneities in a multiple text database environment. In Proceedings of the Fourth IFCIS International Conference on Cooperative Information Systems, Edinburgh, Scotland, pages 22–33. IEEE Computer Society Press, 1999.
Tom Mitchell. Machine Learning. McGraw Hill, 1997.
Mike Perkowitz, Robert B. Doorenbos, Oren Etzioni, and Daniel S. Weld. Learning to understand information on the Internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133–153, March 1997.
Erik Selberg and Oren Etzioni. Multi-Service search and comparison using the MetaCrawler. In Proceedings of the Fourth International World-Wide Web Conference, 1995.
Jinxi Xu and James P. Callan. Effective retrieval with distributed collections. In SIGIR’98, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pages 112–120. ACM Press, 1998.
George K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ipeirotis, P.G., Gravano, L., Sahami, M. (2001). Automatic Classification of Text Databases through Query Probing. In: Goos, G., Hartmanis, J., van Leeuwen, J., Suciu, D., Vossen, G. (eds) The World Wide Web and Databases. WebDB 2000. Lecture Notes in Computer Science, vol 1997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45271-0_16
Download citation
DOI: https://doi.org/10.1007/3-540-45271-0_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41826-9
Online ISBN: 978-3-540-45271-3
eBook Packages: Springer Book Archive