Skip to main content

Automatic Classification of Text Databases through Query Probing

  • Conference paper
  • First Online:
The World Wide Web and Databases (WebDB 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1997))

Included in the following conference series:

Abstract

Many text databases on the web are “hidden” behind search interfaces, and their documents are only accessible through querying. Traditional search engines typically ignore the contents of such searchonly databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier’s rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. James P. Callan, Margaret Connell, and Aiqun Du. Automatic discovery of language models for text databases. In SIGMOD 1999, Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, pages 479–490. ACM Press, 1999.

    Chapter  Google Scholar 

  2. William W. Cohen. Fast effective rule induction. In Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning (ICML’95), Tahoe City, California, USA, pages 115–123. Morgan Kaufmann, 1995.

    Google Scholar 

  3. William W. Cohen. Learning trees and rules with set-valued features. In Proceedings of the 13th National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, volume 1, pages 709–716. American Association for Artificial Intelligence, AAAI Press / The MIT Press, 1996.

    Google Scholar 

  4. Susan Gauch, Guijun Wang, and Mario Gomez. Profusion*: Intelligent fusion from multiple, distributed search engines. The Journal of Universal Computer Science, 2(9):637–649, September 1996.

    Google Scholar 

  5. Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford proposal for Internet meta-searching. In SIGMOD 1997, Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, pages 207–218. ACM Press, 1997.

    Chapter  Google Scholar 

  6. Luis Gravano, Héctor García-Molina, and Anthony Tomasic. GlOSS: Text-Source discovery over the Internet. ACM Transactions on Database Systems, 24(2):229–264, June 1999.

    Google Scholar 

  7. David Hawking and Paul B. Thistlewaite. Methods for information server selection. ACM Transactions on Information Systems, 17(1):40–76, January 1999.

    Article  Google Scholar 

  8. Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical Report CMU-CS-96-118, School of Computer Science, Carnegie Mellon University, March 1996.

    Google Scholar 

  9. Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Machine Learning, Proceedings of the Thirteenth International Conference (ICML’96), Bari, Italy, pages 284–292. Morgan Kaufmann, 1996.

    Google Scholar 

  10. Daphne Koller and Mehran Sahami. Hierarchically classifying documents using very few words. In Machine Learning, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), Nashville, Tennessee, USA, pages 170–178. Morgan Kaufmann, 1997.

    Google Scholar 

  11. Steve Lawrence and C. Lee Giles. Inquirus, the NECI meta search engine. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, pages 95–105, 1998.

    Google Scholar 

  12. Weiyi Meng, King-Lup Liu, Clement T. Yu, Xiaodong Wang, Yuhsi Chang, and Naphtali Rishe. Determining text databases to search in the Internet. In VLDB’98, Proceedings of the 24th International Conference on Very Large Data Bases, New York City, New York, USA, pages 14–25. Morgan Kaufmann, 1998.

    Google Scholar 

  13. Weiyi Meng, Clement T. Yu, and King-Lup Liu. Detection of heterogeneities in a multiple text database environment. In Proceedings of the Fourth IFCIS International Conference on Cooperative Information Systems, Edinburgh, Scotland, pages 22–33. IEEE Computer Society Press, 1999.

    Chapter  Google Scholar 

  14. Tom Mitchell. Machine Learning. McGraw Hill, 1997.

    Google Scholar 

  15. Mike Perkowitz, Robert B. Doorenbos, Oren Etzioni, and Daniel S. Weld. Learning to understand information on the Internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133–153, March 1997.

    Article  Google Scholar 

  16. Erik Selberg and Oren Etzioni. Multi-Service search and comparison using the MetaCrawler. In Proceedings of the Fourth International World-Wide Web Conference, 1995.

    Google Scholar 

  17. Jinxi Xu and James P. Callan. Effective retrieval with distributed collections. In SIGIR’98, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pages 112–120. ACM Press, 1998.

    Chapter  Google Scholar 

  18. George K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ipeirotis, P.G., Gravano, L., Sahami, M. (2001). Automatic Classification of Text Databases through Query Probing. In: Goos, G., Hartmanis, J., van Leeuwen, J., Suciu, D., Vossen, G. (eds) The World Wide Web and Databases. WebDB 2000. Lecture Notes in Computer Science, vol 1997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45271-0_16

Download citation

  • DOI: https://doi.org/10.1007/3-540-45271-0_16

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41826-9

  • Online ISBN: 978-3-540-45271-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics