Skip to main content

Heuristic Approach to Automatic Wrapper Generation for Social Media Websites

  • Conference paper
New Trends in Databases and Information Systems

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 185))

  • 1397 Accesses

Abstract

The data contained within user generated content websites prove to be valuable in many applications, for example in social media monitoring or in acquisition of training sets for machine learning algorithms. Mining such data is especially difficult in case of web forums, because of hundreds of various forum engines used.We propose an algorithm capable of unsupervised extraction of posts from social websites without the need to analyse more than one page in advance. Our method localizes potential data regions by repetition analysis within document structure and filtering potential results. Subsequently the fields of data records are found using key characteristics and series-wide dependencies. We managed to achieve 87% precision of extraction and 82% recall after experiments on single pages taken from 231 websites. Our solution is characterized by high computing efficiency, thus enabling wide applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688. ACM, Hong Kong (2001)

    Chapter  Google Scholar 

  2. Cong, G., et al.: Finding question-answer pairs from online forums. In: SIGIR 2008 Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 467–474. ACM, Singapore (2008)

    Chapter  Google Scholar 

  3. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, Madison (2002)

    Google Scholar 

  4. Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583. AAAI Press (2000)

    Google Scholar 

  5. Glance, N., et al.: Deriving marketing intelligence from online discussion. In: KDD 2005 Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 419–428. ACM, Chicago (2005)

    Chapter  Google Scholar 

  6. Hong, J.L., Fauzi, F.: Tree Wrap-data Extraction Using Tree Matching Algorithm. Majlesi Journal of Electrical Engineering 4(2) (2010)

    Google Scholar 

  7. Kim, P.: The forrester wave: Brand monitoring, Q3 2006, Forrester Wave (2006) (white paper)

    Google Scholar 

  8. Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper Induction for Information Extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence (1997)

    Google Scholar 

  9. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. In: SIGMOD 2004 Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 119–130. ACM, Paris (2004)

    Chapter  Google Scholar 

  10. Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: a machine learning approach. Journal of Artificial Intelligence Research 18(1), 149–181 (2003)

    MATH  Google Scholar 

  11. Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics - Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  12. Li, S., Tang, L., Hu, J., Chen, Z.: Automatic Data Extraction from Web Discussion Forums. In: FCST 2009 Proceedings of the 2009 Fourth International Conference on Frontier of Computer Science and Technology, pp. 219–225. IEEE Computer Society Press, Brak Miejsca (2009)

    Chapter  Google Scholar 

  13. Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: KDD 2003 Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, Washington, DC (2003)

    Chapter  Google Scholar 

  14. Muslea, I., Minton, S., Knoblock, C.: STALKER: Learning Extraction Rules for Semistructured. In: Web-based Information Sources, AAAI (1998)

    Google Scholar 

  15. Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval 2(1-2) (2008)

    Google Scholar 

  16. Papadakis, N., Skoutas, D., Raftopoulos, K., Varvarigou, T.: An Automatic Web Wrapper for Extracting Information from Web Sources, Using Clustering Techniques. In: SAINT 2005 Proceedings of the 2005 Symposium on Applications and the Internet, pp. 24–30. IEEE Computer Society, Washington, DC (2005)

    Google Scholar 

  17. Satpal, S., et al.: Web information extraction using markov logic networks. In: KDD 2011 Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1406–1414. ACM, San Diego (2011)

    Chapter  Google Scholar 

  18. Song, X., et al.: Automatic extraction of web data records containing user-generated content. In: CIKM 2010 Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 39–48. ACM, Toronto (2010)

    Google Scholar 

  19. Su, W., Wang, J., Lochovsky, F.H.: ODE: Ontology-assisted data extraction. ACM Transactions on Database Systems 34(2) (2009)

    Google Scholar 

  20. Weninger, T., et al.: Unexpected results in automatic list extraction on the web. ACM SIGKDD Explorations Newsletter 12(2), 26–30 (2011)

    Article  Google Scholar 

  21. Yang, J.-M., et al.: Incorporating site-level knowledge to extract structured data from web forums. In: WWW 2009 Proceedings of the 18th International Conference on World Wide Web, pp. 181–190. ACM, Madrid (2009)

    Chapter  Google Scholar 

  22. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005 Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM, Chiba (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bartosz Baziński .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baziński, B., Brzezicki, M. (2013). Heuristic Approach to Automatic Wrapper Generation for Social Media Websites. In: Pechenizkiy, M., Wojciechowski, M. (eds) New Trends in Databases and Information Systems. Advances in Intelligent Systems and Computing, vol 185. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32518-2_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32518-2_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32517-5

  • Online ISBN: 978-3-642-32518-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics