Abstract
Often users are faced with the problem of searching the Web for missing values of a spread sheet. It is a fact that today only a few US-based search engines have the capacity to aggregate the wealth of information hidden in Web pages that could be used to return these missing values. Therefore exploiting this information with structured queries, such as join queries, is an often requested, but still unsolved requirement of many Web users.
A major challenge in this scenario is identifying keyword queries for retrieving relevant pages from a Web search engine. We solve this challenge by automatically generating keywords. Our approach is based on the observation that Web page authors have already evolved common words and grammatical structures for describing important relationship types. Each keyword query should return only pages that likely contain a missing relation. Therefore our keyword generator continually monitors grammatical structures or lexical phrases from processed Web pages during query execution. Thereby, the keyword generator infers significant and non-ambiguous keywords for retrieving pages which likely match the mechanics of a particular relation extractor.
We report an experimental study over multiple relation extractors. Our study demonstrates that our generated keywords efficiently return complete result tuples. In contrast to other approaches we only process very few Web pages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kasneci, G., Ramanath, M., Suchanek, F.M., Weikum, G.: The YAGO-NAGA approach to knowledge discovery. SIGMOD Row 37(4), 41–47 (2008)
Jain, A., Doan, A., Gravano, L.: Optimizing SQL Queries over Text Databases. In: ICDE, pp. 636–645. IEEE Computer Society, Washington, DC (2008)
Jain, A., Srivastava, D.: Exploring a Few Good Tuples from Text Databases. In: ICDE, pp. 616–627. IEEE Computer Society, Washington, DC (2009)
Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L.: Join Optimization of Information Extraction Output: Quality Matters! In: ICDE. IEEE Computer Society, Washington, DC (2009)
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD 2008, pp. 1031–1042. ACM, New York (2008)
Agichtein, E., Gravano, L.: QXtract: a building block for efficient information extraction from Web page collections. In: SIGMOD 2003, p. 663. ACM, New York (2003)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Very Large Data Bases, pp. 371–380. Morgan Kaufmann Publishers, Rome (2001)
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the Web. Commun. ACM 51(12), 68–74 (2008)
Löser, A., Lutter, S., Düssel, P., Markl, V.: Ad-hoc Queries over Web page Collections – a Case Study. In: BIRTE Workshop at VLDB. Lyon (2009)
YahooBoss service, http://developer.yahoo.com/search/boss/fees.html (Last visited 01/06/10)
OpenCalais, http://www.opencalais.com/comfaq (Last visited 01/06/10)
Liu, J., Dong, X., Halevy, A.Y.: Answering Structured Queries on Unstructured Data. In: WebDB 2006 (2006)
HSQLDB, http://hsqldb.org/ (Last visited 01/06/10)
Fung, G., Yu, J., Lu, H.: Discriminative Category Matching: Efficient Text Classification for Huge Document Collections. In: ICDM 2002, pp. 187–194 (2002)
Feldman, R., Regev, Y., Gorodetsky, M.: A modular information extraction system. Intell. Data Anal. 12(1), 51–71 (2008)
Fortune 500, http://money.cnn.com/magazines/fortune/fortune500/2008/full_list/ (Last visited 01/06/10)
Croft, W.B., Metzler, D., Strohman, T.: Search Engines, Information Retrieval in Practice, pp. 313–315. Addison Wesley, Reading (2010)
Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: SIGMOD Conference, pp. 85–96 (2005)
Löser, A., Hüske, F., Markl, V.: Situational Business Intelligence. In: BIRTE Workshop at VLDB (2008)
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: SoCC 2010, pp. 119–130 (2010)
Löser, A.: Beyond Search: Web-Scale Business Analytics. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, p. 5. Springer, Heidelberg (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Löser, A., Nagel, C., Pieper, S. (2011). Augmenting Tables by Self-supervised Web Search. In: Castellanos, M., Dayal, U., Markl, V. (eds) Enabling Real-Time Business Intelligence. BIRTE 2010. Lecture Notes in Business Information Processing, vol 84. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22970-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-22970-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22969-5
Online ISBN: 978-3-642-22970-1
eBook Packages: Computer ScienceComputer Science (R0)