Abstract
Creation of web wrappers is a subject of study in the field of web data extraction. Designing a domain-specific language for a web wrapper is a challenging task, because it introduces tradeoffs between expressiveness of a wrapper’s language and safety. In addition, little attention has been paid to execution of a wrapper in a restricted environment. In this paper we present a new wrapping language—Serrano—that has three goals: (1) ability to run in a restricted environment, such as a browser extension, (2) extensibility to balance the tradeoffs between expressiveness of a command set and safety, and (3) processing capabilities to eliminate the need for additional programs to clean the extracted data. Serrano has been successfully deployed in a number of projects and provided competitive results.
A prior version of this paper has been published in the ISD2017 Proceedings (http://aisel.aisnet.org/isd2014/proceedings2017).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
The browser vendor wishes to remain undisclosed.
Bibliography
AJAX. Mozilla Developer Network, 2017. https://developer.mozilla.org/en/ajax
G. Cormode, B. Krishnamurthy: Key differences between Web 1.0 and Web 2.0. First Monday 13(6) (2008)
A vocabulary and associated APIs for HTML and XHTML, 2016. https://www.w3.org/TR/html5/
Laender, A.H., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002)
R. Baumgartner, W. Gatterbauer, G. Gottlob. Web data extraction system. In Encyclopedia of Database Systems, pp. 3465–3471. Springer, Berlin (2009)
Document Object Model (DOM). W3C, 2005. http://www.w3.org/TR/REC-DOM-Level-1/cover.html
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Extensible Markup Language (XML) 1.0 (Fourth Edition), 2006. http://www.w3.org/XML/
D. Crockford. The application/json Media Type for JavaScript Object Notation (JSON). JSON.org (2006)
J. Hammer, J. McHugh, H. Garcia-Molina. Semistructured Data: the TSIMMIS Experience. In: ADBIS ’97, p. 22 (1997)
Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)
Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. JMLR 4, 177–210 (2003)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1), 15–68 (2000)
B. Adelberg: NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. ACM Sigmod Record 27(2):283–294 (1998)
T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, A. Kravchenko, G. Orsi, C. Schallhart, A. Sellers, C. Wang: DIADEM: domain-centric, intelligent, automated data extraction methodology. In: WWW ’12, pp. 267–270. ACM, New York (2012)
T. Furche, G. Gottlob, G. Grasso, C. Schallhart, A. Sellers: OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1), 47–72 (2013)
R. Baumgartner, S. Flesca, G. Gottlob: The Elog web extraction language. In: LPAR, pp. 548–560. Springer, Berlin (2001)
E. Oro, M. Ruffolo, S. Staab: SXPath: extending XPath towards spatial querying on web documents. In: Proc. VLDB Endow. 4(2), 129–140 (2010)
E. Ferrara, P. De Meo, G. Fiumara, R. Baumgartner. Web data extraction, applications and techniques: a survey. Knowl. Based Syst. 70, 301–323 (2014)
G. Gottlob, C. Koch: Monadic datalog and the expressive power of languages for web information extraction. JACM 51(1), 74–113 (2004)
I. Hickson: HTML microdata, 2011. http://www.w3.org/TR/microdata/
Acknowledgements
This work was supported by project SVV 260451.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Novella, T., Holubová, I. (2018). User-Friendly and Extensible Web Data Extraction. In: Paspallis, N., Raspopoulos, M., Barry, C., Lang, M., Linger, H., Schneider, C. (eds) Advances in Information Systems Development. Lecture Notes in Information Systems and Organisation, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-74817-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-74817-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74816-0
Online ISBN: 978-3-319-74817-7
eBook Packages: Business and ManagementBusiness and Management (R0)