Cost-Effective Web Search in Bootstrapping for Named Entity Recognition

Kawai, Hideki; Mizuguchi, Hironori; Tsuchida, Masaaki

doi:10.1007/978-3-540-78568-2_29

Hideki Kawai¹,
Hironori Mizuguchi² &
Masaaki Tsuchida²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4947))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1002 Accesses
2 Citations

Abstract

In this paper, we propose a cost-effective search strategy framework to extract keywords in the same semantic class from the Web. Constructing a dictionary based on the bootstrapping technique is one promising approach to harnessing knowledge scattered around the Web. Open web application programming interfaces (APIs) are powerful tools for the knowledge-gathering process. However, we have to consider the cost of API calls because too many queries can overload the search engines, and they also limit the number of API calls. Our goal is to optimize a search strategy that can collect as many new words as possible with the least API calls. Our results show that the optimized search strategy can extract 64,642 words in five different domains with a precision of 0.94 with only 1,000 search API calls.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bikel, D.M., Milier, S., Schwartz, R., Weischedel, R.: Nymble: A high-performance learning name-filter. In: Proc. the Fifth Conference on Applied Natural Language Processing (1997)
Google Scholar
Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. In: AAAI Workshop on Machine Learning for Information Extraction (1999)
Google Scholar
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)
Google Scholar
Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proc. of the Eleventh National Conference on Artificial Intelligence, pp. 811–816. AAAI Press / The MIT Press (1993)
Google Scholar
Soderland, S., Fisher, D., Aseltine, J.: W. Lehnert, W.: Crystal: Inducing a conceptual dictionary. In: Proc. the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1314–1319 (1995)
Google Scholar
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing (1998)
Google Scholar
Riloff, E., Jones, R.: Learning dictionaries for information extraction using multi-level bootstrapping. In: Proc. the Sixteenth National Conference on Artificial Intelligence, pp. 1044–1049. AAAI Press / The MIT Press (1999)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in KnowItAll. In: Proc. the 13th International Conference on World Wide Web (2004)
Google Scholar
Pasca, M., Lin, D., Bigham, J., Lifchits, A., Jain, A.: Organizing and searching the World Wide Web of facts - step one: the one-million fact extraction challenge. In: Proc. AAAI 2006 (2006)
Google Scholar
Thelen, M., Riloff, E.: A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proc. Conference on Empirical Methods in Natural Language Processing, pp. 214–222 (2002)
Google Scholar
Hasegawa, T., Sekine, S., Grishman, R.: Discovering relations among named entities from large corpora. In: Proc. ACL, pp. 415–422 (2004)
Google Scholar
O’Reilly, T.: What is Web 2.0: Design patterns and business models for the next generation of software, http://www.oreillynet.com/lpt/a/6228
Yahoo! Developer Network, http://developer.yahoo.com/
Google Code, http://code.google.com/
Windows Live Developer Center, http://msdn.microsoft.com/msn/default.aspx
WordNet, http://wordnet.princeton.edu/
Brin, S.: Extracting patterns and relations from the World Wide Web. In: Proc. the International Workshop on the World Wide Web and Databases, pp. 172–183 (1998)
Google Scholar
Soderland, S., Etzioni, O., Shaked, T., Weld, D.: The use of Web-based statistics to validate information extraction. In: AAAI workshop on Adaptive Text Extraction and Mining (2004)
Google Scholar
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. In: AAAI Workshop on AI and Information Integration (1998)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. the 14th International Conference on World Wide Web (2005)
Google Scholar
Chuang, S.-L., Chang, K.C.-C., Zhai, C.: Context-Aware wrapping: synchronized data extraction. In: Proc. the 33rd Very Large Data Bases Conference (VLDB) (2007)
Google Scholar
Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the Web. In: Proc. IEEE International Conference on Data Mining (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

NEC C&C Innovation Research Laboratories, 8916-47, Takayama-cho, Ikoma, Nara, Japan
Hideki Kawai
NEC Service Platforms Research Laboratories, 8916-47, Takayama-cho, Ikoma, Nara, Japan
Hironori Mizuguchi & Masaaki Tsuchida

Authors

Hideki Kawai
View author publications
You can also search for this author in PubMed Google Scholar
Hironori Mizuguchi
View author publications
You can also search for this author in PubMed Google Scholar
Masaaki Tsuchida
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Jayant R. Haritsa Ramamohanarao Kotagiri Vikram Pudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kawai, H., Mizuguchi, H., Tsuchida, M. (2008). Cost-Effective Web Search in Bootstrapping for Named Entity Recognition. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds) Database Systems for Advanced Applications. DASFAA 2008. Lecture Notes in Computer Science, vol 4947. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78568-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-540-78568-2_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78567-5
Online ISBN: 978-3-540-78568-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics