Skip to main content

Cost-Effective Web Search in Bootstrapping for Named Entity Recognition

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4947))

Included in the following conference series:

Abstract

In this paper, we propose a cost-effective search strategy framework to extract keywords in the same semantic class from the Web. Constructing a dictionary based on the bootstrapping technique is one promising approach to harnessing knowledge scattered around the Web. Open web application programming interfaces (APIs) are powerful tools for the knowledge-gathering process. However, we have to consider the cost of API calls because too many queries can overload the search engines, and they also limit the number of API calls. Our goal is to optimize a search strategy that can collect as many new words as possible with the least API calls. Our results show that the optimized search strategy can extract 64,642 words in five different domains with a precision of 0.94 with only 1,000 search API calls.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bikel, D.M., Milier, S., Schwartz, R., Weischedel, R.: Nymble: A high-performance learning name-filter. In: Proc. the Fifth Conference on Applied Natural Language Processing (1997)

    Google Scholar 

  2. Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. In: AAAI Workshop on Machine Learning for Information Extraction (1999)

    Google Scholar 

  3. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)

    Google Scholar 

  4. Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proc. of the Eleventh National Conference on Artificial Intelligence, pp. 811–816. AAAI Press / The MIT Press (1993)

    Google Scholar 

  5. Soderland, S., Fisher, D., Aseltine, J.: W. Lehnert, W.: Crystal: Inducing a conceptual dictionary. In: Proc. the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1314–1319 (1995)

    Google Scholar 

  6. Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing (1998)

    Google Scholar 

  7. Riloff, E., Jones, R.: Learning dictionaries for information extraction using multi-level bootstrapping. In: Proc. the Sixteenth National Conference on Artificial Intelligence, pp. 1044–1049. AAAI Press / The MIT Press (1999)

    Google Scholar 

  8. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in KnowItAll. In: Proc. the 13th International Conference on World Wide Web (2004)

    Google Scholar 

  9. Pasca, M., Lin, D., Bigham, J., Lifchits, A., Jain, A.: Organizing and searching the World Wide Web of facts - step one: the one-million fact extraction challenge. In: Proc. AAAI 2006 (2006)

    Google Scholar 

  10. Thelen, M., Riloff, E.: A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proc. Conference on Empirical Methods in Natural Language Processing, pp. 214–222 (2002)

    Google Scholar 

  11. Hasegawa, T., Sekine, S., Grishman, R.: Discovering relations among named entities from large corpora. In: Proc. ACL, pp. 415–422 (2004)

    Google Scholar 

  12. O’Reilly, T.: What is Web 2.0: Design patterns and business models for the next generation of software, http://www.oreillynet.com/lpt/a/6228

  13. Yahoo! Developer Network, http://developer.yahoo.com/

  14. Google Code, http://code.google.com/

  15. Windows Live Developer Center, http://msdn.microsoft.com/msn/default.aspx

  16. WordNet, http://wordnet.princeton.edu/

  17. Brin, S.: Extracting patterns and relations from the World Wide Web. In: Proc. the International Workshop on the World Wide Web and Databases, pp. 172–183 (1998)

    Google Scholar 

  18. Soderland, S., Etzioni, O., Shaked, T., Weld, D.: The use of Web-based statistics to validate information extraction. In: AAAI workshop on Adaptive Text Extraction and Mining (2004)

    Google Scholar 

  19. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. In: AAAI Workshop on AI and Information Integration (1998)

    Google Scholar 

  20. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. the 14th International Conference on World Wide Web (2005)

    Google Scholar 

  21. Chuang, S.-L., Chang, K.C.-C., Zhai, C.: Context-Aware wrapping: synchronized data extraction. In: Proc. the 33rd Very Large Data Bases Conference (VLDB) (2007)

    Google Scholar 

  22. Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the Web. In: Proc. IEEE International Conference on Data Mining (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Jayant R. Haritsa Ramamohanarao Kotagiri Vikram Pudi

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kawai, H., Mizuguchi, H., Tsuchida, M. (2008). Cost-Effective Web Search in Bootstrapping for Named Entity Recognition. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds) Database Systems for Advanced Applications. DASFAA 2008. Lecture Notes in Computer Science, vol 4947. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78568-2_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78568-2_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78567-5

  • Online ISBN: 978-3-540-78568-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics