Advertisement

Exploiting Pre-Existing Datasets to Support IETS

  • Eli CortezEmail author
  • Altigran S. da Silva
Chapter
  • 980 Downloads
Part of the SpringerBriefs in Computer Science book series (BRIEFSCOMPUTER)

Abstract

This chapter describes in detail a new approach for exploiting preexisting datasets to support Information Extraction by Text Segmentation methods. First, it presents a brief overview of the approach and introduces the concept of knowledge base. Next, it discusses all the steps involved in the unsupervised approach, including how to learn content-based features from knowledge bases, how to automatically induce structure-based features with no previous human-driven training, a feature that is unique to this approach, and how to effectively combine these features to label segments of a text input.

Keywords

Information extraction Unsupervised approach Text segmentation Databases Structured data Knowledge bases Markov models 

References

  1. Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 20–29). Seattle, USA.Google Scholar
  2. Agrawal, S., & Chaudhuri, S. (2003). Automated ranking of database query results. In Proceedings of the CIDR Biennial Conference on Innovative Data Systems Research, Asilomar, USA.Google Scholar
  3. Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 175–186). Santa Barbara, USA.Google Scholar
  4. Chiang, F., Andritsos, P., Zhu, E., & Miller, R. (2012). Autodict: Automated dictionary discovery. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 1277–1280). Washington, USA.Google Scholar
  5. Cortez, E., da Silva, A., Gonçalves, M., & de Moura, E. (2010). ONDUX: On-demand unsupervised learning for information extraction. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 807–818). Indianapolis, USA.Google Scholar
  6. Cortez, E., da Silva, A., Gonçalves, M., Mesquita, F., & de Moura, E. (2007). FLUX-CIM: flexible unsupervised extraction of citation metadata. Proceedings of the ACM/IEEE JCDL Joint Conference on Digital Libraries (pp. 215–224). Vancouver, Canada.Google Scholar
  7. Cortez, E., & da Silva, A. S. (2010). Unsupervised strategies for information extraction by text segmentation. Proceedings of the SIGMOD PhD Workshop on Innovative Database Research (pp. 49–54). Indianapolis, USA.Google Scholar
  8. Cortez, E., da Silva, A. S., de Moura, E. S., & Laender, A. H. F. (2011). Joint unsupervised structure discovery and information extraction. Proceedings of the ACM SIGMOD International Conference on Management of Data Conference (pp. 541–552). Athens, Greece.Google Scholar
  9. Fan, W., Gordon, M., & Pathak, P. (2004). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on knowledge and Data Engineering, 16(4), 523.Google Scholar
  10. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the European Conference on Machine Learning (pp. 137–142). Chemnitz, Germany.Google Scholar
  11. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the ICML International Conference on Machine Learning (pp. 282–289). Williamstown, USA.Google Scholar
  12. Mansuri, I. R., & Sarawagi, S. (2006). Integrating unstructured data into relational databases. Proceedings of the IEEE ICDE International Conference on Data Engineering (pp. 29–41). Atlanta, USA.Google Scholar
  13. Mesquita, F., da Silva, A., de Moura, E., Calado, P., & Laender, A. (2007). LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing and Management, 43(4), 983–1004.CrossRefGoogle Scholar
  14. Pearl, J., & Shafer, G. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco: Morgan Kaufmann Publishers Inc.Google Scholar
  15. Porto, A., Cortez, E., da Silva, A. S., & de Moura, E. S. (2011). Unsupervised information extraction with the ondux tool. Florianpolis: In Simpsio Brasileiro de Banco de Dados.Google Scholar
  16. Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.CrossRefzbMATHGoogle Scholar
  17. Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377.CrossRefGoogle Scholar
  18. Serra, E., Cortez, E., da Silva, A., & de Moura, E. (2011). On using wikipedia to build knowledge bases for information extraction by text segmentation. Journal of Information and Data Management, 2(3), 259.Google Scholar
  19. Toda, G., Cortez, E., da Silva, A. S., & de Moura, E. S. (2010). A probabilistic approach for automatically filling form-based web interfaces. Proceedings of the VLDB Endowment, 4(3), 151–160.Google Scholar
  20. Toda, G., Cortez, E., Mesquita, F., da Silva, A., Moura, E., & Neubert, M. (2009). Automatically filling form-based web interfaces with free text inputs. Proceedings of the WWW International World Wide Web Conferences (pp. 1163–1164). Madrid, Spain.Google Scholar
  21. Toda, G. A., & da Silva, A. S. (2006). Um Mtodo Probabilstico para o Preenchimento Automtico de Formulrios Web a Partir de Textos Ricos em Dados. Universidade Federal do Amazonas.Google Scholar
  22. Zhao, C., Mahmud, J., & Ramakrishnan, I. (2008). Exploiting structured reference data for unsupervised text segmentation with conditional random fields. Proceedings of the SIAM International Conference on Data Mining (pp. 420–431). Atlanta, USA.Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.Instituto de ComputaçãoUniversidade Federal do AmazonasManausBrazil

Personalised recommendations