Advertisement

Related Work

  • Eli CortezEmail author
  • Altigran S. da Silva
Chapter
  • 978 Downloads
Part of the SpringerBriefs in Computer Science book series (BRIEFSCOMPUTER)

Abstract

In the literature, different approaches have been proposed to address the problem of extracting valuable data from the Web. In this chapter is presented an overview of such approaches. It begins by presenting a broad set of Web extraction methods and tools. Following a taxonomy previously used in the literature (Laender et al. 2002), they are divided into distinct groups according to their main approach. These groups are: Languages for Wrapper Development, Wrapper Induction Methods, NLP-based Methods, Ontology-based Methods, and HTML-aware Methods. Next, it is specifically presented probabilistic graph-based methods, supervised and unsupervised, and discusses their main characteristics in comparison to the unsupervised approach presented in this book.

Keywords

Information extraction Wrappers NLP HTML Probabilistic methods CRF 

References

  1. Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29). USA: Seattle.Google Scholar
  2. Arocena, G., & Mendelzon, A. (1998). Weboql: Restructuring documents, databases and webs. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 24–33). USA: Orlando.Google Scholar
  3. Banko, M., Cafarella, M., Soderland, S., Broadhead, M., & Etzioni, O. (2009). Open information extraction for the web. PhD thesis, University of Washington, Washington.Google Scholar
  4. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. Proceedings of the IJCAI International Joint Conference on Artificial Intelligence (pp. 2670–2676). India: Hyderabad.Google Scholar
  5. Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186). USA: Santa Barbara.Google Scholar
  6. Cafarella, M., Halevy, A., Wang, D., Wu, E., & Zhang, Y. (2008). Webtables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538–549.Google Scholar
  7. Chiang, F., Andritsos, P., Zhu, E., & Miller, R. (2012). Autodict: Automated dictionary discovery. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 1277–1280). USA: Washington.Google Scholar
  8. Chuang, S., Chang, K., & Zhai, C. (2007). Context-aware wrapping: synchronized data extraction. Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 699–710). Austria: Viena.Google Scholar
  9. Cortez, E., da Silva, A., Gonçalves, M., Mesquita, F., & de Moura, E. (2007). FLUX-CIM: flexible unsupervised extraction of citation metadata. Proceedings of the ACM/IEEE JCDL Joint Conference on Digital Libraries (pp. 215–224). Canada: Vancouver.Google Scholar
  10. Cortez, E., da Silva, A. S., Gonçalves, M. A., Mesquita, F., & de Moura, E. S. (2009). A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, 60(6), 1144–1158.CrossRefGoogle Scholar
  11. Crescenzi, V., & Mecca, G. (1998). Grammars have exceptions. Information Systems, 23(8), 539–565.CrossRefGoogle Scholar
  12. Crescenzi, V., Mecca, G., & Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 109–118). Italy: Rome.Google Scholar
  13. Dalvi, N., Bohannon, P., & Sha, F. (2009). Robust web extraction: an approach based on a probabilistic tree-edit model. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 335–348). Rhode Island, USA: Providence.Google Scholar
  14. Elmeleegy, H., Madhavan, J., & Halevy, A. (2009). Harvesting relational tables from lists on the web. Proceedings of the VLDB Endowment, 2(1), 1078–1089.Google Scholar
  15. Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y., et al. (1999a). Conceptual-model-based data extraction from multiple-record web pages. Data and Knowledge Engineering, 31(3), 227–251.CrossRefzbMATHGoogle Scholar
  16. Embley, D., Jiang, Y., & Ng, Y. (1999b). Record-boundary discovery in web documents. ACM SIGMOD Record, 28(2), 467–478.CrossRefGoogle Scholar
  17. Etzioni, O., Banko, M., Soderland, S., & Weld, D. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.CrossRefGoogle Scholar
  18. Freitag, D., & McCallum, A. (2000). Information extraction with HMM structures learned by Stochastic optimization. Proceedings of the National Conference on Artificial Intelligence and Conference on Innovative Applications of Artificial Intelligence (pp. 584–589). USA: Austin.Google Scholar
  19. Hammer, J., McHugh, J., & Garcia-Molina, H. (1997). Semistructured data: The tsimmis experience. Proceedings of the East-European Symposium on Advances in Databases and Information Systems (pp. 1–8). Russia: St. Petersburg.Google Scholar
  20. Hsu, C., & Dung, M. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8), 521–538.CrossRefGoogle Scholar
  21. Kristjansson, T., Culotta, A., Viola, P., & McCallum, A. (2004). Interactive information extraction with constrained conditional random fields. Proceedings of the AAAI National Conference on Artificial Inteligence (pp. 412–418). San Jose: USA.Google Scholar
  22. Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1–2), 15–68.MathSciNetCrossRefzbMATHGoogle Scholar
  23. Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.CrossRefGoogle Scholar
  24. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML International Conference on Machine Learning (pp. 282–289). USA: Williamstown.Google Scholar
  25. Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41). USA: Atlanta.Google Scholar
  26. Michelson, M., & Knoblock, C. (2007). Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. International Journal on Document Analysis and Recognition, 10(3), 211–226.Google Scholar
  27. Mooney, R. (1999). Relational learning of pattern-match rules for information extraction. Proceedings of the National Conference on Artificial Intelligence (pp. 328–334). USA: Orlando.Google Scholar
  28. Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 4(1–2), 93–114.CrossRefGoogle Scholar
  29. Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979.CrossRefGoogle Scholar
  30. Reis, D. C., Golgher, P. B., Silva, A. S., & Laender, A. F. (2004). Automatic web news extraction using tree edit distance. Proceedings of the WWW International World Wide Web Conferences (pp. 502–511). USA: New York.Google Scholar
  31. Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.CrossRefGoogle Scholar
  32. Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.Google Scholar
  33. Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine learning, 34(1), 233–272.CrossRefzbMATHGoogle Scholar
  34. Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. Proceedings of the SIAM International Conference on Data Mining (pp. 420–431). USA: Atlanta.Google Scholar
  35. Zhao, H., Meng, W., Wu, Z., Raghavan, V., & Yu, C. (2005). Fully automatic wrapper generation for search engines. Proceedings of the WWW International World Wide Web Conferences (pp. 66–75). Japan: Chiba.Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.Instituto de ComputaçãoUniversidade Federal do AmazonasManausBrazil

Personalised recommendations