Skip to main content

User-Friendly and Extensible Web Data Extraction

  • Conference paper
  • First Online:
  • 831 Accesses

Part of the book series: Lecture Notes in Information Systems and Organisation ((LNISO,volume 26))

Abstract

Creation of web wrappers is a subject of study in the field of web data extraction. Designing a domain-specific language for a web wrapper is a challenging task, because it introduces tradeoffs between expressiveness of a wrapper’s language and safety. In addition, little attention has been paid to execution of a wrapper in a restricted environment. In this paper we present a new wrapping language—Serrano—that has three goals: (1) ability to run in a restricted environment, such as a browser extension, (2) extensibility to balance the tradeoffs between expressiveness of a command set and safety, and (3) processing capabilities to eliminate the need for additional programs to clean the extracted data. Serrano has been successfully deployed in a number of projects and provided competitive results.

A prior version of this paper has been published in the ISD2017 Proceedings (http://aisel.aisnet.org/isd2014/proceedings2017).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/salsita/Serrano/wiki/Language-Spec.

  2. 2.

    https://github.com/salsita/Serrano/tree/master/serrano-library.

  3. 3.

    https://jquery.com/.

  4. 4.

    https://lodash.com/.

  5. 5.

    http://w3c.github.io/webappsec-credential-management/.

  6. 6.

    https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/.

  7. 7.

    http://www.seleniumhq.org/docs/08_user_extensions.jsp##chapter08-reference.

  8. 8.

    https://api.jquery.com/category/selectors/.

  9. 9.

    http://api.jquery.com/jquery/.

  10. 10.

    http://imacros.net/.

  11. 11.

    http://wiki.imacros.net/FAQ##Q:_Does_the_macro_script_wait_for_the_page_to_fully_finish_loading.3F.

  12. 12.

    http://wiki.imacros.net/WAIT.

  13. 13.

    https://github.com/salsita/Serrano/wiki/Language-Spec.

  14. 14.

    https://magneto.me/welcome/about-us.html.

  15. 15.

    https://github.com/salsita/Serrano/tree/master/magneto/scraping-units.

  16. 16.

    http://mypoints.com/.

  17. 17.

    https://github.com/salsita/Serrano/wiki/Language-Spec##dom-manipulation.

  18. 18.

    The browser vendor wishes to remain undisclosed.

Bibliography

  1. AJAX. Mozilla Developer Network, 2017. https://developer.mozilla.org/en/ajax

  2. G. Cormode, B. Krishnamurthy: Key differences between Web 1.0 and Web 2.0. First Monday 13(6) (2008)

    Google Scholar 

  3. A vocabulary and associated APIs for HTML and XHTML, 2016. https://www.w3.org/TR/html5/

  4. Laender, A.H., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  5. R. Baumgartner, W. Gatterbauer, G. Gottlob. Web data extraction system. In Encyclopedia of Database Systems, pp. 3465–3471. Springer, Berlin (2009)

    Google Scholar 

  6. Document Object Model (DOM). W3C, 2005. http://www.w3.org/TR/REC-DOM-Level-1/cover.html

  7. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  8. Extensible Markup Language (XML) 1.0 (Fourth Edition), 2006. http://www.w3.org/XML/

  9. D. Crockford. The application/json Media Type for JavaScript Object Notation (JSON). JSON.org (2006)

    Google Scholar 

  10. J. Hammer, J. McHugh, H. Garcia-Molina. Semistructured Data: the TSIMMIS Experience. In: ADBIS ’97, p. 22 (1997)

    Google Scholar 

  11. Sahuguet, A., Azavant, F.: Building intelligent web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)

    Article  Google Scholar 

  12. Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. JMLR 4, 177–210 (2003)

    Google Scholar 

  13. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1), 15–68 (2000)

    Article  Google Scholar 

  14. B. Adelberg: NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. ACM Sigmod Record 27(2):283–294 (1998)

    Google Scholar 

  15. T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, A. Kravchenko, G. Orsi, C. Schallhart, A. Sellers, C. Wang: DIADEM: domain-centric, intelligent, automated data extraction methodology. In: WWW ’12, pp. 267–270. ACM, New York (2012)

    Google Scholar 

  16. T. Furche, G. Gottlob, G. Grasso, C. Schallhart, A. Sellers: OXPath: a language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1), 47–72 (2013)

    Google Scholar 

  17. R. Baumgartner, S. Flesca, G. Gottlob: The Elog web extraction language. In: LPAR, pp. 548–560. Springer, Berlin (2001)

    Google Scholar 

  18. E. Oro, M. Ruffolo, S. Staab: SXPath: extending XPath towards spatial querying on web documents. In: Proc. VLDB Endow. 4(2), 129–140 (2010)

    Google Scholar 

  19. E. Ferrara, P. De Meo, G. Fiumara, R. Baumgartner. Web data extraction, applications and techniques: a survey. Knowl. Based Syst. 70, 301–323 (2014)

    Google Scholar 

  20. G. Gottlob, C. Koch: Monadic datalog and the expressive power of languages for web information extraction. JACM 51(1), 74–113 (2004)

    Google Scholar 

  21. I. Hickson: HTML microdata, 2011. http://www.w3.org/TR/microdata/

Download references

Acknowledgements

This work was supported by project SVV 260451.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. Holubová .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Novella, T., Holubová, I. (2018). User-Friendly and Extensible Web Data Extraction. In: Paspallis, N., Raspopoulos, M., Barry, C., Lang, M., Linger, H., Schneider, C. (eds) Advances in Information Systems Development. Lecture Notes in Information Systems and Organisation, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-74817-7_14

Download citation

Publish with us

Policies and ethics