Heuristic Approach to Automatic Wrapper Generation for Social Media Websites

Baziński, Bartosz; Brzezicki, Michał

doi:10.1007/978-3-642-32518-2_26

Bartosz Baziński³ &
Michał Brzezicki³

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 185))

1397 Accesses

Abstract

The data contained within user generated content websites prove to be valuable in many applications, for example in social media monitoring or in acquisition of training sets for machine learning algorithms. Mining such data is especially difficult in case of web forums, because of hundreds of various forum engines used.We propose an algorithm capable of unsupervised extraction of posts from social websites without the need to analyse more than one page in advance. Our method localizes potential data regions by repetition analysis within document structure and filtering potential results. Subsequently the fields of data records are found using key characteristics and series-wide dependencies. We managed to achieve 87% precision of extraction and 82% recall after experiments on single pages taken from 231 websites. Our solution is characterized by high computing efficiency, thus enabling wide applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688. ACM, Hong Kong (2001)
Chapter Google Scholar
Cong, G., et al.: Finding question-answer pairs from online forums. In: SIGIR 2008 Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 467–474. ACM, Singapore (2008)
Chapter Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, Madison (2002)
Google Scholar
Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. AAAI Press (2000)
Google Scholar
Glance, N., et al.: Deriving marketing intelligence from online discussion. In: KDD 2005 Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 419–428. ACM, Chicago (2005)
Chapter Google Scholar
Hong, J.L., Fauzi, F.: Tree Wrap-data Extraction Using Tree Matching Algorithm. Majlesi Journal of Electrical Engineering 4(2) (2010)
Google Scholar
Kim, P.: The forrester wave: Brand monitoring, Q3 2006, Forrester Wave (2006) (white paper)
Google Scholar
Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper Induction for Information Extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (1997)
Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. In: SIGMOD 2004 Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 119–130. ACM, Paris (2004)
Chapter Google Scholar
Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: a machine learning approach. Journal of Artificial Intelligence Research 18(1), 149–181 (2003)
MATH Google Scholar
Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics - Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Li, S., Tang, L., Hu, J., Chen, Z.: Automatic Data Extraction from Web Discussion Forums. In: FCST 2009 Proceedings of the 2009 Fourth International Conference on Frontier of Computer Science and Technology, pp. 219–225. IEEE Computer Society Press, Brak Miejsca (2009)
Chapter Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: KDD 2003 Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, Washington, DC (2003)
Chapter Google Scholar
Muslea, I., Minton, S., Knoblock, C.: STALKER: Learning Extraction Rules for Semistructured. In: Web-based Information Sources, AAAI (1998)
Google Scholar
Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval 2(1-2) (2008)
Google Scholar
Papadakis, N., Skoutas, D., Raftopoulos, K., Varvarigou, T.: An Automatic Web Wrapper for Extracting Information from Web Sources, Using Clustering Techniques. In: SAINT 2005 Proceedings of the 2005 Symposium on Applications and the Internet, pp. 24–30. IEEE Computer Society, Washington, DC (2005)
Google Scholar
Satpal, S., et al.: Web information extraction using markov logic networks. In: KDD 2011 Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1406–1414. ACM, San Diego (2011)
Chapter Google Scholar
Song, X., et al.: Automatic extraction of web data records containing user-generated content. In: CIKM 2010 Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 39–48. ACM, Toronto (2010)
Google Scholar
Su, W., Wang, J., Lochovsky, F.H.: ODE: Ontology-assisted data extraction. ACM Transactions on Database Systems 34(2) (2009)
Google Scholar
Weninger, T., et al.: Unexpected results in automatic list extraction on the web. ACM SIGKDD Explorations Newsletter 12(2), 26–30 (2011)
Article Google Scholar
Yang, J.-M., et al.: Incorporating site-level knowledge to extract structured data from web forums. In: WWW 2009 Proceedings of the 18th International Conference on World Wide Web, pp. 181–190. ACM, Madrid (2009)
Chapter Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005 Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM, Chiba (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 80-233, Gdańsk, Poland
Bartosz Baziński & Michał Brzezicki

Authors

Bartosz Baziński
View author publications
You can also search for this author in PubMed Google Scholar
Michał Brzezicki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bartosz Baziński .

Editor information

Editors and Affiliations

, Department of Computer Science, Eindhoven University of Technology, Eindhoven, 5600, Netherlands
Mykola Pechenizkiy
Institute of Computing Science, Poznan University of Technology, ul. Piotrowo 2, Poznan, 60-965, Poland
Marek Wojciechowski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baziński, B., Brzezicki, M. (2013). Heuristic Approach to Automatic Wrapper Generation for Social Media Websites. In: Pechenizkiy, M., Wojciechowski, M. (eds) New Trends in Databases and Information Systems. Advances in Intelligent Systems and Computing, vol 185. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32518-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-32518-2_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32517-5
Online ISBN: 978-3-642-32518-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics