Skip to main content

Augmenting Tables by Self-supervised Web Search

  • Conference paper
Enabling Real-Time Business Intelligence (BIRTE 2010)

Abstract

Often users are faced with the problem of searching the Web for missing values of a spread sheet. It is a fact that today only a few US-based search engines have the capacity to aggregate the wealth of information hidden in Web pages that could be used to return these missing values. Therefore exploiting this information with structured queries, such as join queries, is an often requested, but still unsolved requirement of many Web users.

A major challenge in this scenario is identifying keyword queries for retrieving relevant pages from a Web search engine. We solve this challenge by automatically generating keywords. Our approach is based on the observation that Web page authors have already evolved common words and grammatical structures for describing important relationship types. Each keyword query should return only pages that likely contain a missing relation. Therefore our keyword generator continually monitors grammatical structures or lexical phrases from processed Web pages during query execution. Thereby, the keyword generator infers significant and non-ambiguous keywords for retrieving pages which likely match the mechanics of a particular relation extractor.

We report an experimental study over multiple relation extractors. Our study demonstrates that our generated keywords efficiently return complete result tuples. In contrast to other approaches we only process very few Web pages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kasneci, G., Ramanath, M., Suchanek, F.M., Weikum, G.: The YAGO-NAGA approach to knowledge discovery. SIGMOD Row 37(4), 41–47 (2008)

    Article  Google Scholar 

  2. Jain, A., Doan, A., Gravano, L.: Optimizing SQL Queries over Text Databases. In: ICDE, pp. 636–645. IEEE Computer Society, Washington, DC (2008)

    Google Scholar 

  3. Jain, A., Srivastava, D.: Exploring a Few Good Tuples from Text Databases. In: ICDE, pp. 616–627. IEEE Computer Society, Washington, DC (2009)

    Google Scholar 

  4. Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L.: Join Optimization of Information Extraction Output: Quality Matters! In: ICDE. IEEE Computer Society, Washington, DC (2009)

    Google Scholar 

  5. Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD 2008, pp. 1031–1042. ACM, New York (2008)

    Google Scholar 

  6. Agichtein, E., Gravano, L.: QXtract: a building block for efficient information extraction from Web page collections. In: SIGMOD 2003, p. 663. ACM, New York (2003)

    Google Scholar 

  7. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Very Large Data Bases, pp. 371–380. Morgan Kaufmann Publishers, Rome (2001)

    Google Scholar 

  8. Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the Web. Commun. ACM 51(12), 68–74 (2008)

    Article  Google Scholar 

  9. Löser, A., Lutter, S., Düssel, P., Markl, V.: Ad-hoc Queries over Web page Collections – a Case Study. In: BIRTE Workshop at VLDB. Lyon (2009)

    Google Scholar 

  10. YahooBoss service, http://developer.yahoo.com/search/boss/fees.html (Last visited 01/06/10)

  11. OpenCalais, http://www.opencalais.com/comfaq (Last visited 01/06/10)

  12. Liu, J., Dong, X., Halevy, A.Y.: Answering Structured Queries on Unstructured Data. In: WebDB 2006 (2006)

    Google Scholar 

  13. HSQLDB, http://hsqldb.org/ (Last visited 01/06/10)

  14. Fung, G., Yu, J., Lu, H.: Discriminative Category Matching: Efficient Text Classification for Huge Document Collections. In: ICDM 2002, pp. 187–194 (2002)

    Google Scholar 

  15. Feldman, R., Regev, Y., Gorodetsky, M.: A modular information extraction system. Intell. Data Anal. 12(1), 51–71 (2008)

    Google Scholar 

  16. Fortune 500, http://money.cnn.com/magazines/fortune/fortune500/2008/full_list/ (Last visited 01/06/10)

  17. Croft, W.B., Metzler, D., Strohman, T.: Search Engines, Information Retrieval in Practice, pp. 313–315. Addison Wesley, Reading (2010)

    Google Scholar 

  18. Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: SIGMOD Conference, pp. 85–96 (2005)

    Google Scholar 

  19. Löser, A., Hüske, F., Markl, V.: Situational Business Intelligence. In: BIRTE Workshop at VLDB (2008)

    Google Scholar 

  20. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: SoCC 2010, pp. 119–130 (2010)

    Google Scholar 

  21. Löser, A.: Beyond Search: Web-Scale Business Analytics. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, p. 5. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Löser, A., Nagel, C., Pieper, S. (2011). Augmenting Tables by Self-supervised Web Search. In: Castellanos, M., Dayal, U., Markl, V. (eds) Enabling Real-Time Business Intelligence. BIRTE 2010. Lecture Notes in Business Information Processing, vol 84. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22970-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22970-1_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22969-5

  • Online ISBN: 978-3-642-22970-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics