Augmenting Tables by Self-supervised Web Search

Löser, Alexander; Nagel, Christoph; Pieper, Stephan

doi:10.1007/978-3-642-22970-1_7

Alexander Löser⁸,
Christoph Nagel⁸ &
Stephan Pieper⁸

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 84))

Included in the following conference series:

International Workshop on Business Intelligence for the Real-Time Enterprise

493 Accesses
4 Citations

Abstract

Often users are faced with the problem of searching the Web for missing values of a spread sheet. It is a fact that today only a few US-based search engines have the capacity to aggregate the wealth of information hidden in Web pages that could be used to return these missing values. Therefore exploiting this information with structured queries, such as join queries, is an often requested, but still unsolved requirement of many Web users.

A major challenge in this scenario is identifying keyword queries for retrieving relevant pages from a Web search engine. We solve this challenge by automatically generating keywords. Our approach is based on the observation that Web page authors have already evolved common words and grammatical structures for describing important relationship types. Each keyword query should return only pages that likely contain a missing relation. Therefore our keyword generator continually monitors grammatical structures or lexical phrases from processed Web pages during query execution. Thereby, the keyword generator infers significant and non-ambiguous keywords for retrieving pages which likely match the mechanics of a particular relation extractor.

We report an experimental study over multiple relation extractors. Our study demonstrates that our generated keywords efficiently return complete result tuples. In contrast to other approaches we only process very few Web pages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kasneci, G., Ramanath, M., Suchanek, F.M., Weikum, G.: The YAGO-NAGA approach to knowledge discovery. SIGMOD Row 37(4), 41–47 (2008)
Article Google Scholar
Jain, A., Doan, A., Gravano, L.: Optimizing SQL Queries over Text Databases. In: ICDE, pp. 636–645. IEEE Computer Society, Washington, DC (2008)
Google Scholar
Jain, A., Srivastava, D.: Exploring a Few Good Tuples from Text Databases. In: ICDE, pp. 616–627. IEEE Computer Society, Washington, DC (2009)
Google Scholar
Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L.: Join Optimization of Information Extraction Output: Quality Matters! In: ICDE. IEEE Computer Society, Washington, DC (2009)
Google Scholar
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD 2008, pp. 1031–1042. ACM, New York (2008)
Google Scholar
Agichtein, E., Gravano, L.: QXtract: a building block for efficient information extraction from Web page collections. In: SIGMOD 2003, p. 663. ACM, New York (2003)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Very Large Data Bases, pp. 371–380. Morgan Kaufmann Publishers, Rome (2001)
Google Scholar
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the Web. Commun. ACM 51(12), 68–74 (2008)
Article Google Scholar
Löser, A., Lutter, S., Düssel, P., Markl, V.: Ad-hoc Queries over Web page Collections – a Case Study. In: BIRTE Workshop at VLDB. Lyon (2009)
Google Scholar
YahooBoss service, http://developer.yahoo.com/search/boss/fees.html (Last visited 01/06/10)
OpenCalais, http://www.opencalais.com/comfaq (Last visited 01/06/10)
Liu, J., Dong, X., Halevy, A.Y.: Answering Structured Queries on Unstructured Data. In: WebDB 2006 (2006)
Google Scholar
HSQLDB, http://hsqldb.org/ (Last visited 01/06/10)
Fung, G., Yu, J., Lu, H.: Discriminative Category Matching: Efficient Text Classification for Huge Document Collections. In: ICDM 2002, pp. 187–194 (2002)
Google Scholar
Feldman, R., Regev, Y., Gorodetsky, M.: A modular information extraction system. Intell. Data Anal. 12(1), 51–71 (2008)
Google Scholar
Fortune 500, http://money.cnn.com/magazines/fortune/fortune500/2008/full_list/ (Last visited 01/06/10)
Croft, W.B., Metzler, D., Strohman, T.: Search Engines, Information Retrieval in Practice, pp. 313–315. Addison Wesley, Reading (2010)
Google Scholar
Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: SIGMOD Conference, pp. 85–96 (2005)
Google Scholar
Löser, A., Hüske, F., Markl, V.: Situational Business Intelligence. In: BIRTE Workshop at VLDB (2008)
Google Scholar
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: SoCC 2010, pp. 119–130 (2010)
Google Scholar
Löser, A.: Beyond Search: Web-Scale Business Analytics. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, p. 5. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

DIMA Group, Technische Universität Berlin, Einsteinufer 17, 10587, Berlin, Germany
Alexander Löser, Christoph Nagel & Stephan Pieper

Authors

Alexander Löser
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Nagel
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Pieper
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Hewlett-Packard, 1501 Page Mill Rd, MS-1142, 94304, Palo Alto, CA, USA
Malu Castellanos & Umeshwar Dayal &
Technische Universität Berlin, Einsteinufer 17, 10587, Berlin, Germany
Volker Markl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Löser, A., Nagel, C., Pieper, S. (2011). Augmenting Tables by Self-supervised Web Search. In: Castellanos, M., Dayal, U., Markl, V. (eds) Enabling Real-Time Business Intelligence. BIRTE 2010. Lecture Notes in Business Information Processing, vol 84. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22970-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-22970-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22969-5
Online ISBN: 978-3-642-22970-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics