Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Gkotsis, George; Stepanyan, Karen; Cristea, Alexandra I.; Joy, Mike

doi:10.1007/978-3-642-39467-6_26

George Gkotsis¹⁹,
Karen Stepanyan¹⁹,
Alexandra I. Cristea¹⁹ &
…
Mike Joy¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7968))

Included in the following conference series:

British National Conference on Databases

5099 Accesses
4 Citations

Abstract

Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adelberg, B.: NoDoSE–a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec. 27(2), 283–294 (1998)
Article Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128. Morgan Kaufmann Publishers, San Francisco (2001)
Google Scholar
Baumgartner, R., Gatterbauer, W., Gottlob, G.: Web data extraction system. In: Encyclopedia of Database Systems, pp. 3465–3471. Springer (2009)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the International Conference on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Dutton, W., Blank, G.: Next generation users: The internet in Britain (2011)
Google Scholar
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)
Google Scholar
Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, K.-U.: Classification of documents based on the structure of their DOM trees. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds.) ICONIP 2007, Part II. LNCS, vol. 4985, pp. 779–788. Springer, Heidelberg (2008)
Chapter Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)
Chapter Google Scholar
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1), 15–68 (2000)
Article MathSciNet MATH Google Scholar
Laender, A., Ribeiro-Neto, B., Da Silva, A., Teixeira, J.: A brief survey of web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002)
Article Google Scholar
Liu, L., Pu, C., Han, W.: XWrap: An extensible wrapper construction system for internet information. In: Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), San Diego, CA, pp. 611–621. IEEE (2000)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1), 93–114 (2001)
Article Google Scholar
Oita, M., Senellart, P.: Archiving data objects using Web feeds. In: Proceedings of International Web Archiving Workshop, Vienna, Austria, pp. 31–41 (2010)
Google Scholar
Pennock, M., Davis, R.: ArchivePress: A Really Simple Solution to Archiving Blog Content. In: Sixth International Conference on Preservation of Digital Objects (iPRES 2009), California Digital Library, San Francisco, USA (October 2009)
Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods American Statistical Association, pp. 354–359 (1990)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
Google Scholar
Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., Soderland, S.: Textrunner: Open information extraction on the web. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 25–26 (2007)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web, pp. 76–85. ACM (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Warwick, Coventry, CV4 7AL, United Kingdom
George Gkotsis, Karen Stepanyan, Alexandra I. Cristea & Mike Joy

Authors

George Gkotsis
View author publications
You can also search for this author in PubMed Google Scholar
Karen Stepanyan
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra I. Cristea
View author publications
You can also search for this author in PubMed Google Scholar
Mike Joy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Oxford, Wolfson Building, Parks Road, OX1 3 QD, Oxford, UK
Georg Gottlob
Department of Computer Science, Oxford University, Wolfson Building, Parks Road, OX1 3QD, Oxford, UK
Giovanni Grasso & Christian Schallhart &
University of Oxford, Wolfson Building, Parks Road, OX1 3QD, Oxford, UK
Dan Olteanu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gkotsis, G., Stepanyan, K., Cristea, A.I., Joy, M. (2013). Self-supervised Automated Wrapper Generation for Weblog Data Extraction. In: Gottlob, G., Grasso, G., Olteanu, D., Schallhart, C. (eds) Big Data. BNCOD 2013. Lecture Notes in Computer Science, vol 7968. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39467-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-39467-6_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39466-9
Online ISBN: 978-3-642-39467-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics