Abstract
The data contained within user generated content websites prove to be valuable in many applications, for example in social media monitoring or in acquisition of training sets for machine learning algorithms. Mining such data is especially difficult in case of web forums, because of hundreds of various forum engines used.We propose an algorithm capable of unsupervised extraction of posts from social websites without the need to analyse more than one page in advance. Our method localizes potential data regions by repetition analysis within document structure and filtering potential results. Subsequently the fields of data records are found using key characteristics and series-wide dependencies. We managed to achieve 87% precision of extraction and 82% recall after experiments on single pages taken from 231 websites. Our solution is characterized by high computing efficiency, thus enabling wide applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688. ACM, Hong Kong (2001)
Cong, G., et al.: Finding question-answer pairs from online forums. In: SIGIR 2008 Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 467–474. ACM, Singapore (2008)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, Madison (2002)
Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. AAAI Press (2000)
Glance, N., et al.: Deriving marketing intelligence from online discussion. In: KDD 2005 Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 419–428. ACM, Chicago (2005)
Hong, J.L., Fauzi, F.: Tree Wrap-data Extraction Using Tree Matching Algorithm. Majlesi Journal of Electrical Engineering 4(2) (2010)
Kim, P.: The forrester wave: Brand monitoring, Q3 2006, Forrester Wave (2006) (white paper)
Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper Induction for Information Extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (1997)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. In: SIGMOD 2004 Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 119–130. ACM, Paris (2004)
Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: a machine learning approach. Journal of Artificial Intelligence Research 18(1), 149–181 (2003)
Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics - Doklady 10(8), 707–710 (1966)
Li, S., Tang, L., Hu, J., Chen, Z.: Automatic Data Extraction from Web Discussion Forums. In: FCST 2009 Proceedings of the 2009 Fourth International Conference on Frontier of Computer Science and Technology, pp. 219–225. IEEE Computer Society Press, Brak Miejsca (2009)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: KDD 2003 Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, Washington, DC (2003)
Muslea, I., Minton, S., Knoblock, C.: STALKER: Learning Extraction Rules for Semistructured. In: Web-based Information Sources, AAAI (1998)
Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval 2(1-2) (2008)
Papadakis, N., Skoutas, D., Raftopoulos, K., Varvarigou, T.: An Automatic Web Wrapper for Extracting Information from Web Sources, Using Clustering Techniques. In: SAINT 2005 Proceedings of the 2005 Symposium on Applications and the Internet, pp. 24–30. IEEE Computer Society, Washington, DC (2005)
Satpal, S., et al.: Web information extraction using markov logic networks. In: KDD 2011 Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1406–1414. ACM, San Diego (2011)
Song, X., et al.: Automatic extraction of web data records containing user-generated content. In: CIKM 2010 Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 39–48. ACM, Toronto (2010)
Su, W., Wang, J., Lochovsky, F.H.: ODE: Ontology-assisted data extraction. ACM Transactions on Database Systems 34(2) (2009)
Weninger, T., et al.: Unexpected results in automatic list extraction on the web. ACM SIGKDD Explorations Newsletter 12(2), 26–30 (2011)
Yang, J.-M., et al.: Incorporating site-level knowledge to extract structured data from web forums. In: WWW 2009 Proceedings of the 18th International Conference on World Wide Web, pp. 181–190. ACM, Madrid (2009)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005 Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM, Chiba (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Baziński, B., Brzezicki, M. (2013). Heuristic Approach to Automatic Wrapper Generation for Social Media Websites. In: Pechenizkiy, M., Wojciechowski, M. (eds) New Trends in Databases and Information Systems. Advances in Intelligent Systems and Computing, vol 185. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32518-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-32518-2_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32517-5
Online ISBN: 978-3-642-32518-2
eBook Packages: EngineeringEngineering (R0)