User-Friendly and Extensible Web Data Extraction

Novella, T.; Holubová, I.

doi:10.1007/978-3-319-74817-7_14

User-Friendly and Extensible Web Data Extraction

T. Novella¹⁴ &
I. Holubová¹⁴

Conference paper
First Online: 28 March 2018

831 Accesses

Part of the book series: Lecture Notes in Information Systems and Organisation ((LNISO,volume 26))

Abstract

Creation of web wrappers is a subject of study in the field of web data extraction. Designing a domain-specific language for a web wrapper is a challenging task, because it introduces tradeoffs between expressiveness of a wrapper’s language and safety. In addition, little attention has been paid to execution of a wrapper in a restricted environment. In this paper we present a new wrapping language—Serrano—that has three goals: (1) ability to run in a restricted environment, such as a browser extension, (2) extensibility to balance the tradeoffs between expressiveness of a command set and safety, and (3) processing capabilities to eliminate the need for additional programs to clean the extracted data. Serrano has been successfully deployed in a number of projects and provided competitive results.

A prior version of this paper has been published in the ISD2017 Proceedings (http://aisel.aisnet.org/isd2014/proceedings2017).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://github.com/salsita/Serrano/wiki/Language-Spec.
2.
https://github.com/salsita/Serrano/tree/master/serrano-library.
3.
https://jquery.com/.
4.
https://lodash.com/.
5.
http://w3c.github.io/webappsec-credential-management/.
6.
https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/.
7.
http://www.seleniumhq.org/docs/08_user_extensions.jsp##chapter08-reference.
8.
https://api.jquery.com/category/selectors/.
9.
http://api.jquery.com/jquery/.
10.
http://imacros.net/.
11.
http://wiki.imacros.net/FAQ##Q:_Does_the_macro_script_wait_for_the_page_to_fully_finish_loading.3F.
12.
http://wiki.imacros.net/WAIT.
13.
https://github.com/salsita/Serrano/wiki/Language-Spec.
14.
https://magneto.me/welcome/about-us.html.
15.
https://github.com/salsita/Serrano/tree/master/magneto/scraping-units.
16.
http://mypoints.com/.
17.
https://github.com/salsita/Serrano/wiki/Language-Spec##dom-manipulation.
18.
The browser vendor wishes to remain undisclosed.

Bibliography

AJAX. Mozilla Developer Network, 2017. https://developer.mozilla.org/en/ajax
G. Cormode, B. Krishnamurthy: Key differences between Web 1.0 and Web 2.0. First Monday 13(6) (2008)
Google Scholar
A vocabulary and associated APIs for HTML and XHTML, 2016. https://www.w3.org/TR/html5/
Laender, A.H., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002)
Article Google Scholar
R. Baumgartner, W. Gatterbauer, G. Gottlob. Web data extraction system. In Encyclopedia of Database Systems, pp. 3465–3471. Springer, Berlin (2009)
Google Scholar
Document Object Model (DOM). W3C, 2005. http://www.w3.org/TR/REC-DOM-Level-1/cover.html
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Extensible Markup Language (XML) 1.0 (Fourth Edition), 2006. http://www.w3.org/XML/
D. Crockford. The application/json Media Type for JavaScript Object Notation (JSON). JSON.org (2006)
Google Scholar
J. Hammer, J. McHugh, H. Garcia-Molina. Semistructured Data: the TSIMMIS Experience. In: ADBIS ’97, p. 22 (1997)
Google Scholar
Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)
Article Google Scholar
Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. JMLR 4, 177–210 (2003)
Google Scholar
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1), 15–68 (2000)
Article Google Scholar
B. Adelberg: NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. ACM Sigmod Record 27(2):283–294 (1998)
Google Scholar
T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, A. Kravchenko, G. Orsi, C. Schallhart, A. Sellers, C. Wang: DIADEM: domain-centric, intelligent, automated data extraction methodology. In: WWW ’12, pp. 267–270. ACM, New York (2012)
Google Scholar
T. Furche, G. Gottlob, G. Grasso, C. Schallhart, A. Sellers: OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1), 47–72 (2013)
Google Scholar
R. Baumgartner, S. Flesca, G. Gottlob: The Elog web extraction language. In: LPAR, pp. 548–560. Springer, Berlin (2001)
Google Scholar
E. Oro, M. Ruffolo, S. Staab: SXPath: extending XPath towards spatial querying on web documents. In: Proc. VLDB Endow. 4(2), 129–140 (2010)
Google Scholar
E. Ferrara, P. De Meo, G. Fiumara, R. Baumgartner. Web data extraction, applications and techniques: a survey. Knowl. Based Syst. 70, 301–323 (2014)
Google Scholar
G. Gottlob, C. Koch: Monadic datalog and the expressive power of languages for web information extraction. JACM 51(1), 74–113 (2004)
Google Scholar
I. Hickson: HTML microdata, 2011. http://www.w3.org/TR/microdata/

Download references

Acknowledgements

This work was supported by project SVV 260451.

Author information

Authors and Affiliations

Charles University, Prague, Czechia
T. Novella & I. Holubová

Authors

T. Novella
View author publications
You can also search for this author in PubMed Google Scholar
I. Holubová
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to I. Holubová .

Editor information

Editors and Affiliations

School of Sciences, University of Central Lancashire, Larnaca, Cyprus
Nearchos Paspallis
School of Sciences, University of Central Lancashire, Larnaca, Cyprus
Marios Raspopoulos
Cairnes School of Business and Economics, National University of Ireland Galway, Galway, Ireland
Chris Barry
Cairnes School of Business and Economics, National University of Ireland Galway, Galway, Ireland
Michael Lang
Faculty of Information Technology, Monash University, Melbourne, Australia
Henry Linger
Department of Information Systems, City University of Hong Kong, Kowloon, Hong Kong
Christoph Schneider

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Novella, T., Holubová, I. (2018). User-Friendly and Extensible Web Data Extraction. In: Paspallis, N., Raspopoulos, M., Barry, C., Lang, M., Linger, H., Schneider, C. (eds) Advances in Information Systems Development. Lecture Notes in Information Systems and Organisation, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-74817-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-74817-7_14
Published: 28 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74816-0
Online ISBN: 978-3-319-74817-7
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics