Automatic Classification of Text Databases through Query Probing

Ipeirotis, Panagiotis G.; Gravano, Luis; Sahami, Mehran

doi:10.1007/3-540-45271-0_16

Panagiotis G. Ipeirotis⁶,
Luis Gravano⁶ &
Mehran Sahami⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1997))

Included in the following conference series:

International Workshop on the World Wide Web and Databases

Abstract

Many text databases on the web are “hidden” behind search interfaces, and their documents are only accessible through querying. Traditional search engines typically ignore the contents of such searchonly databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier’s rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In SIGMOD 1999, Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, pages 479–490. ACM Press, 1999.
Chapter Google Scholar
William W. Cohen. Fast effective rule induction. In Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning (ICML’95), Tahoe City, California, USA, pages 115–123. Morgan Kaufmann, 1995.
Google Scholar
William W. Cohen. Learning trees and rules with set-valued features. In Proceedings of the 13th National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, volume 1, pages 709–716. American Association for Artificial Intelligence, AAAI Press / The MIT Press, 1996.
Google Scholar
Susan Gauch, Guijun Wang, and Mario Gomez. Profusion*: Intelligent fusion from multiple, distributed search engines. The Journal of Universal Computer Science, 2(9):637–649, September 1996.
Google Scholar
Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford proposal for Internet meta-searching. In SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, pages 207–218. ACM Press, 1997.
Chapter Google Scholar
Luis Gravano, Héctor García-Molina, and Anthony Tomasic. GlOSS: Text-Source discovery over the Internet. ACM Transactions on Database Systems, 24(2):229–264, June 1999.
Google Scholar
David Hawking and Paul B. Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40–76, January 1999.
Article Google Scholar
Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical Report CMU-CS-96-118, School of Computer Science, Carnegie Mellon University, March 1996.
Google Scholar
Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Machine Learning, Proceedings of the Thirteenth International Conference (ICML’96), Bari, Italy, pages 284–292. Morgan Kaufmann, 1996.
Google Scholar
Daphne Koller and Mehran Sahami. Hierarchically classifying documents using very few words. In Machine Learning, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), Nashville, Tennessee, USA, pages 170–178. Morgan Kaufmann, 1997.
Google Scholar
Steve Lawrence and C. Lee Giles. Inquirus, the NECI meta search engine. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, pages 95–105, 1998.
Google Scholar
Weiyi Meng, King-Lup Liu, Clement T. Yu, Xiaodong Wang, Yuhsi Chang, and Naphtali Rishe. Determining text databases to search in the Internet. In VLDB’98, Proceedings of the 24th International Conference on Very Large Data Bases, New York City, New York, USA, pages 14–25. Morgan Kaufmann, 1998.
Google Scholar
Weiyi Meng, Clement T. Yu, and King-Lup Liu. Detection of heterogeneities in a multiple text database environment. In Proceedings of the Fourth IFCIS International Conference on Cooperative Information Systems, Edinburgh, Scotland, pages 22–33. IEEE Computer Society Press, 1999.
Chapter Google Scholar
Tom Mitchell. Machine Learning. McGraw Hill, 1997.
Google Scholar
Mike Perkowitz, Robert B. Doorenbos, Oren Etzioni, and Daniel S. Weld. Learning to understand information on the Internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133–153, March 1997.
Article Google Scholar
Erik Selberg and Oren Etzioni. Multi-Service search and comparison using the MetaCrawler. In Proceedings of the Fourth International World-Wide Web Conference, 1995.
Google Scholar
Jinxi Xu and James P. Callan. Effective retrieval with distributed collections. In SIGIR’98, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pages 112–120. ACM Press, 1998.
Chapter Google Scholar
George K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Columbia University, 1214 Amsterdam Avenue, Mailcode: 0401, NY 10027-7003, New York, USA
Panagiotis G. Ipeirotis & Luis Gravano
E.piphany, Inc., 1900 South Norfolk Street, Suite 310, CA 94403, San Mateo, USA
Mehran Sahami

Authors

Panagiotis G. Ipeirotis
View author publications
You can also search for this author in PubMed Google Scholar
Luis Gravano
View author publications
You can also search for this author in PubMed Google Scholar
Mehran Sahami
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universität Münster, Wirtschaftsinformatik Steinfurter Str.109, 48149, Münster, Germany
Gerhard Goos
Karlsruhe University, Germany
Juris Hartmanis
Cornell University, NY, USA
Jan van Leeuwen
Utrecht University, The Netherlands
Dan Suciu
Computer Science and Engineering, University ofWashington, WA 98195-2350, Seattle, USA
Gottfried Vossen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ipeirotis, P.G., Gravano, L., Sahami, M. (2001). Automatic Classification of Text Databases through Query Probing. In: Goos, G., Hartmanis, J., van Leeuwen, J., Suciu, D., Vossen, G. (eds) The World Wide Web and Databases. WebDB 2000. Lecture Notes in Computer Science, vol 1997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45271-0_16

Download citation

DOI: https://doi.org/10.1007/3-540-45271-0_16
Published: 22 June 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41826-9
Online ISBN: 978-3-540-45271-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics