Advertisement

Automatic Extraction of Logical Web Lists

  • Pasqua Fabiana Lanotte
  • Fabio Fumarola
  • Michelangelo Ceci
  • Andrea Scarpino
  • Michele Damiano Torelli
  • Donato Malerba
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8502)

Abstract

Recently, there has been increased interest in the extraction of structured data from the web (both “Surface” Web and“Hidden” Web). In particular, in this paper we focus on the automatic extraction of Web Lists. Although this task has been studied extensively, existing approaches are based on the assumption that lists are wholly contained in a Web page.They do not consider that many websites span their listing on several Web Pages and show for each of these only a partial view. Similar to databases, where a view can represent a subset of the data contained in a table, they split a logical list in multiple views (view lists). Automatic extraction of logical lists is an open problem. To tackle this issue we propose an unsupervised and domain-independent algorithm for logical list extraction. Experimental results on real-life and data-intensive Web sites confirm the effectiveness of our approach.

Keywords

Web List Mining Structured Data Extraction Logical List 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baumgartner, R.: Datalog-related aspects in lixto visual developer. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.) Datalog 2010. LNCS, vol. 6702, pp. 145–160. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  2. 2.
    Bing, L., Lam, W., Gu, Y.: Towards a unified solution: Data record region detection and segmentation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1265–1274. ACM, New York (2011)Google Scholar
  3. 3.
    Cafarella, M.J., Halevy, A., Madhavan, J.: Structured data on the web. Commun. ACM 54(2), 72–79 (2011)CrossRefGoogle Scholar
  4. 4.
    Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005)CrossRefGoogle Scholar
  5. 5.
    Elmeleegy, H., Madhavan, J., Halevy, A.: Harvesting relational tables from lists on the web. The VLDB Journal 20(2), 209–226 (2011)CrossRefGoogle Scholar
  6. 6.
    Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 1535–1545. Association for Computational Linguistics, Stroudsburg (2011)Google Scholar
  7. 7.
    Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Extracting general lists from web documents: A hybrid approach. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds.) IEA/AIE 2011, Part I. LNCS, vol. 6703, pp. 285–294. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  8. 8.
    Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Hylien: A hybrid approach to general list extraction on the web. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) WWW (Companion Volume), pp. 35–36. ACM (2011)Google Scholar
  9. 9.
    Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 71–80. ACM, New York (2007)CrossRefGoogle Scholar
  10. 10.
    Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 119–130. ACM, New York (2004)Google Scholar
  11. 11.
    Lie, H.W., Bos, B.: Cascading Style Sheets: Designing for the Web, 3rd edn., p. 5. Addison-Wesley Professional (2005)Google Scholar
  12. 12.
    Lin, C.X., Zhao, B., Weninger, T., Han, J., Liu, B.: Entity relation discovery from web tables and links. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1145–1146. ACM, New York (2010)CrossRefGoogle Scholar
  13. 13.
    Liu, B., Grossman, R.L., Zhai, Y.: Mining web pages for data records. IEEE Intelligent Systems 19(6), 49–55 (2004)CrossRefGoogle Scholar
  14. 14.
    Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)CrossRefGoogle Scholar
  15. 15.
    Maximilien, E.M., Ranabahu, A.: The programmableweb: Agile, social, and grassroot computing. In: Proceedings of the International Conference on Semantic Computing, ICSC 2007, pp. 477–481. IEEE Computer Society, Washington, DC (2007)Google Scholar
  16. 16.
    Miao, G., Tatemura, J., Hsiung, W.: Extracting data records from the web using tag path clustering. In: The World Wide Web Conference, pp. 981–990 (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Pasqua Fabiana Lanotte
    • 1
  • Fabio Fumarola
    • 1
  • Michelangelo Ceci
    • 1
  • Andrea Scarpino
    • 1
  • Michele Damiano Torelli
    • 1
  • Donato Malerba
    • 1
  1. 1.Dipartimento di InformaticaUniversita degli Studi di Bari “Aldo Moro”BariItaly

Personalised recommendations