Advertisement

Introduction

  • Eli CortezEmail author
  • Altigran S. da Silva
Chapter
  • 995 Downloads
Part of the SpringerBriefs in Computer Science book series (BRIEFSCOMPUTER)

Abstract

The Information Extraction problem (IE) refers to the automatic extraction of structured information from noisy unstructured textual sources. This problem is a research topic in different Computer Science communities, such as: Databases, Information Retrieval, and Artificial Intelligence. This chapter provides an introduction of this problem and also an overview of how information extraction fits into the broader topics of data management. It also provides a list of the main contribution that can be found in this book.

Keywords

World Wide Web Information extraction Textual sources Text segmentation Databases Data management 

References

  1. Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29), Seattle, USA.Google Scholar
  2. Banko, M., Cafarella, M., Soderland, S., Broadhead, M., & Etzioni, O. (2009). Open information extraction for the web. PhD thesis, University of Washington.Google Scholar
  3. Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden-web entry points. In Proceedings of the WWW International World Wide Web Conferences (pp. 441–450), Alberta, Canada.Google Scholar
  4. Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186), Santa Barbara, USA.Google Scholar
  5. Cafarella, M., Halevy, A., Wang, D., Wu, E., & Zhang, Y. (2008). Webtables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1), 538–549.Google Scholar
  6. Chang, K., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the web: Observations and implications. ACM SIGMOD Record, 33(3), 61–70.CrossRefGoogle Scholar
  7. Chang, C., Kayed, M., Girgis, M., & Shaalan, K. (2006). A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1411–1428.CrossRefGoogle Scholar
  8. Chuang, S., Chang, K., & Zhai, C. (2007). Context-aware wrapping: Synchronized data extraction. In Proceedings of the VLDB International Conference on Very Large Data Bases (pp. 699–710), Viena, Austria.Google Scholar
  9. Cortez, E., & da Silva, A. S. (2010). Unsupervised strategies for information extraction by text segmentation. In Proceedings of the SIGMOD PhD Workshop on Innovative Database Research (pp. 49–54), Indianapolis, USA.Google Scholar
  10. Cortez, E., da Silva, A., Gonçalves, M., & de Moura, E. (2010). ONDUX: On-demand unsupervised learning for information extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 807–818), Indianapolis, USA.Google Scholar
  11. Cortez, E., da Silva, A. S., de Moura, E. S., & Laender, A. H. F. (2011). Joint unsupervised structure discovery and information extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 541–552), Athens, Greece.Google Scholar
  12. Fader, A., Soderland, S., & Etzioni, O. (2011). Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1535–1545), Edinburgh, UK.Google Scholar
  13. Freitag, D., & McCallum, A. (2000). Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the National Conference on Artificial Intelligence and Conference on Innovative Applications of Artificial Intelligence (pp. 584–589), Austin, USA.Google Scholar
  14. Halevy, A. (2012). Towards an ecosystem of structured data on the web. In Proceedings of the International Conference on Extending Database Technology (pp. 1–2), Berlin, Germany.Google Scholar
  15. Jin, W., Ho, H., & Srihari, R. (2009). OpinionMiner: A novel machine learning system for web opinion mining and extraction. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1195–1204), Paris, France.Google Scholar
  16. Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., & Teixeira, J. S. (2002). A brief survey of web data extraction tools. SIGMOD Record, 31(2), 84–93.CrossRefGoogle Scholar
  17. Laender, A., Moro, M., Gonçalves, M., Davis, Jr., C., da Silva, A., Silva, A., et al. (2011a). Building a research social network from an individual perspective. In Proceedings of the International ACM/IEEE Joint Conference on Digital Libraries (pp. 427–428), Ottawa, Canada.Google Scholar
  18. Laender, A., Moro, M., Gonçalves, M., Davis Jr, C., da Silva, A., Silva, A., et al. (2011b). Ciência Brasil—the Brazilian portal of science and technology. In Integrated Seminar of Software and Hardware (SEMISH), Natal, Brasil.Google Scholar
  19. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML International Conference on Machine Learning (pp. 282–289), Williamstown, USA.Google Scholar
  20. Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., et al. (2007). Web-scale data integration: You can only afford to pay as you go. In Proceedings of the CIDR Biennial Conference on Innovative Data Systems Research (pp. 342–350), Asilomar, USA.Google Scholar
  21. Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. In Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41), Atlanta, USA.Google Scholar
  22. Mausam, Schmitz, M., Soderland, S., Bart, R., & Etzioni, O. (2012). Open language learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 523–534), Jeju Island, Korea.Google Scholar
  23. Mesquita, F., & Barbosa, D. (2011). Extracting meta statements from the blogosphere. In Proceedings of the International Conference on Weblogs and Social Media, Barcelona, Spain.Google Scholar
  24. Peng, F., & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4), 963–979.CrossRefGoogle Scholar
  25. Porto, A., Cortez, E., da Silva, A. S., & de Moura, E. S. (2011). Unsupervised information extraction with the ondux tool. In Simpsio Brasileiro de Banco de Dados, Florianpolis, Brasil.Google Scholar
  26. Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Computational Natural Language Learning (pp. 147–155), Stroudsburg, USA.Google Scholar
  27. Ritter, A., Clark, S., & Etzioni, O. (2011). Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1524–1534), Edinburgh, UK.Google Scholar
  28. Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.CrossRefGoogle Scholar
  29. Sardi Mergen, S., Freire, J., & Heuser, C. (2010). Indexing relations on the web. In Proceedings of the International Conference on Extending Database Technology (pp. 430–440), Lausanne, Switzerland.Google Scholar
  30. Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using Wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.Google Scholar
  31. Toda, G., Cortez, E., Mesquita, F., da Silva, A., Moura, E., & Neubert, M. (2009). Automatically filling form-based web interfaces with free text inputs. In Proceedings of the WWW International World Wide Web Conferences (pp. 1163–1164), Madrid, Spain.Google Scholar
  32. Toda, G., Cortez, E., da Silva, A. S., & de Moura, E. S. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3), 151–160.Google Scholar
  33. Vidal, M., da Silva, A., de Moura, E., & Cavalcanti, J. (2006). Structure-driven crawler generation by example. In Proceedings of the International ACM SIGIR Conference on Research & Development of Information Retrieval (pp. 292–299), Seattle, USA.Google Scholar
  34. Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proceedings of the SIAM International Conference on Data Mining (pp. 420–431), Atlanta, USA.Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.Instituto de ComputaçãoUniversidade Federal do AmazonasManausBrazil

Personalised recommendations