Abstract
Unstructured data, i.e., data that has not been created for computer usage, make up about 80 % of the entire amount of digital documents. Most of the time, unstructured data are textual documents written in natural language: clearly, this kind of data is a powerful information source that needs to be handled well. Access to unstructured data may be greatly improved with respect to traditional information retrieval methods by using deep language understanding methods. In this chapter, we provide a brief overview of the relationship between natural language processing and search applications. We describe some machine learning methods that are used for formalizing natural language problems in probabilistic terms. We then discuss the main challenges behind automatic text processing, focusing on question answering as a representative example of the application of various deep text processing techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available at: alias-i.com/lingpipe.
- 2.
References
S. Abney, M. Collins, A. Singhal, Answer extraction, in Proceedings of the Sixth Conference on Applied Natural Language Processing. ANLC’00 (Association for Computational Linguistics, Stroudsburg, 2000), pp. 296–301
D. Beeferman, A. Berger, J. Lafferty, Statistical models for text segmentation. Mach. Learn. 34, 177–210 (1999) doi:10.1023/A:1007506220214
A. Carlson, C. Cumby, J. Rosen, D. Roth, The SNoW learning architecture, Technical report, Technical report UIUCDCS, 1999
X. Carreras, L. Màrquez, Introduction to the CoNLL-2005 shared task: semantic role labeling, in Proceedings of the Ninth Conference on Computational Natural Language Learning, (Association for Computational Linguistics, Stroudsburg, 2005), pp. 152–164
W.B. Cavnar, J.M. Trenkle, N-gram-based text categorization. Ann Arbor MI 48113(2), 161–175 (1994), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367
E. Charniak, Statistical techniques for natural language parsing. AI Mag. 18, 33–44 (1997)
M. Collins, Head-driven statistical models for natural language parsing, Ph.D. thesis, University of Pennsylvania, 1999
M. Collins, N. Duffy, New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron, in ACL (2002)
K. Collins-Thompson, J. Callan, A language modeling approach to predicting reading difficulty, in Proceedings of HLT/NAACL, vol. 4 (2004)
A. Culotta, J. Sorensen, Dependency tree kernels for relation extraction, in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (Association for Computational Linguistics, Stroudsburg, 2004), p. 423
A. Esuli, F. Sebastiani, Determining term subjectivity and term orientation for opinion mining, in Proceedings the 11th Meeting of the European Chapter of the Association for Computational Linguistics (EACL-2006) (2006), pp. 193–200
D.C. Gondek, A. Lally, A. Kalyanpur, J.W. Murdock, P.A. Duboue, L. Zhang, Y. Pan, Z.M. Qiu, C. Welty, A framework for merging and ranking of answers in DeepQA. IBM J. Res. Dev. 56(3), 399–410 (2012)
S. Grimes, Unstructured data and the 80 percent rule. Carabridge Bridgepoints (2008)
X. Huang, A. Acero, H.W. Hon, et al., Spoken Language Processing, vol. 15 (Prentice Hall, New York, 2001)
T. Joachims, Making Large-Scale Support Vector Machine Learning Practical (MIT Press, Cambridge, 1999), pp. 169–184
A. Kalyanpur, B.K. Boguraev, S. Patwardhan, J.W. Murdock, A. Lally, C. Welty, J.M. Prager, B. Coppola, A. Fokoue-Nkoutche, L. Zhang, Y. Pan, Z.M. Qiu, Structured data and inference in DeepQA. IBM J. Res. Dev. 56(3.4), 10 (2012). doi:10.1147/JRD.2012.2188737
V. Kešelj, F. Peng, N. Cercone, C. Thomas, N-gram-based author profiles for authorship attribution, in Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING’03 (2003)
P. Kingsbury, M. Palmer, From TreeBank to PropBank, in Proceedings of LREC (2002)
D. Klein, C.D. Manning, Accurate unlexicalized parsing, in Proceedings of ACL (Association for Computational Linguistics, Stroudsburg, 2003), pp. 423–430
J.D. Lafferty, A. McCallum, F.C.N. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in ICML, ed. by C.E. Brodley, A.P. Danyluk (Morgan Kaufmann, San Mateo, 2001), pp. 282–289
K.-F. Lee, Automatic Speech Recognition: the Development of the Sphinx Recognition System, vol. 62 (Kluwer Academic, Norwell, 1989)
C. Lee, Y.-G. Hwang, M.-G. Jang, Fine-grained named entity recognition and relation extraction for question answering, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’07 (ACM, New York, 2007), pp. 799–800
X. Li, D. Roth, Learning question classifiers, in Proceedings of the 19th International Conference on Computational Linguistics—Volume 1. COLING’02 (Association for Computational Linguistics, Stroudsburg, 2002), pp. 1–7
A. Moschitti, S. Quarteroni, Linguistic kernels for answer re-ranking in question answering systems. Inf. Process. Manag. 47(6), 825–842 (2011)
A. Moschitti, S. Quarteroni, R. Basili, S. Manandhar, Exploiting syntactic and shallow semantic kernels for question answer classification, in ACL (2007)
L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: bringing order to the web, Technical report, Stanford InfoLab, 1999
S. Quarteroni, S. Manandhar, Designing an interactive open-domain question answering system. Nat. Lang. Eng. 15(1), 73–95 (2009)
S. Quarteroni, A.V. Ivanov, G. Riccardi, Simultaneous dialog act segmentation and classification from human–human spoken conversations, in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE Press, New York, 2011), pp. 5596–5599
L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
E. Saquete, P. Martinez-Barco, R. Munoz, J. Vicedo, Splitting complex temporal questions for question answering systems, in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (Association for Computational Linguistics, Stroudsburg, 2004), p. 566
F. Sebastiani, Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
C.C. Shilakes, J. Tylman, Enterprise Information Portals (Merrill, Columbus, 1998), p. 16
R.F. Simmons, Answering English questions by computer: a survey. Commun. ACM 8(1), 53–70 (1965)
A. Stolcke, SRILM-an extensible language modeling toolkit, in Seventh International Conference on Spoken Language Processing (2002)
M. Surdeanu, M. Ciaramita, H. Zaragoza, Learning to rank answers to non-factoid questions from web collections. Comput. Linguist. 37(2), 351–383 (2011)
Y. Yang, An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1), 69–90 (1999)
D. Zelenko, C. Aone, A. Richardella, Kernel methods for relation extraction, in JMLR (2003)
C. Zhai, J. Lafferty, A study of smoothing methods for language models applied to ad hoc information retrieval, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’01 (ACM, New York, 2001), pp. 334–342
D. Zhang, W.S. Lee, Question classification using support vector machines, in Proceedings of SIGIR (ACM, New York, 2003)
G.D. Zhou, J. Su, Named entity recognition using an HMM-based chunk tagger, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Association for Computational Linguistics, Stroudsburg, 2002), pp. 473–480
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ceri, S., Bozzon, A., Brambilla, M., Della Valle, E., Fraternali, P., Quarteroni, S. (2013). Natural Language Processing for Search. In: Web Information Retrieval. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39314-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-39314-3_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39313-6
Online ISBN: 978-3-642-39314-3
eBook Packages: Computer ScienceComputer Science (R0)