Skip to main content

Natural Language Processing for Search

  • Chapter
Web Information Retrieval

Abstract

Unstructured data, i.e., data that has not been created for computer usage, make up about 80 % of the entire amount of digital documents. Most of the time, unstructured data are textual documents written in natural language: clearly, this kind of data is a powerful information source that needs to be handled well. Access to unstructured data may be greatly improved with respect to traditional information retrieval methods by using deep language understanding methods. In this chapter, we provide a brief overview of the relationship between natural language processing and search applications. We describe some machine learning methods that are used for formalizing natural language problems in probabilistic terms. We then discuss the main challenges behind automatic text processing, focusing on question answering as a representative example of the application of various deep text processing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 79.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available at: alias-i.com/lingpipe.

  2. 2.

    www.cis.upenn.edu/~ace.

References

  1. S. Abney, M. Collins, A. Singhal, Answer extraction, in Proceedings of the Sixth Conference on Applied Natural Language Processing. ANLC’00 (Association for Computational Linguistics, Stroudsburg, 2000), pp. 296–301

    Chapter  Google Scholar 

  2. D. Beeferman, A. Berger, J. Lafferty, Statistical models for text segmentation. Mach. Learn. 34, 177–210 (1999) doi:10.1023/A:1007506220214

    Article  MATH  Google Scholar 

  3. A. Carlson, C. Cumby, J. Rosen, D. Roth, The SNoW learning architecture, Technical report, Technical report UIUCDCS, 1999

    Google Scholar 

  4. X. Carreras, L. Màrquez, Introduction to the CoNLL-2005 shared task: semantic role labeling, in Proceedings of the Ninth Conference on Computational Natural Language Learning, (Association for Computational Linguistics, Stroudsburg, 2005), pp. 152–164

    Chapter  Google Scholar 

  5. W.B. Cavnar, J.M. Trenkle, N-gram-based text categorization. Ann Arbor MI 48113(2), 161–175 (1994), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367

    Google Scholar 

  6. E. Charniak, Statistical techniques for natural language parsing. AI Mag. 18, 33–44 (1997)

    Google Scholar 

  7. M. Collins, Head-driven statistical models for natural language parsing, Ph.D. thesis, University of Pennsylvania, 1999

    Google Scholar 

  8. M. Collins, N. Duffy, New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron, in ACL (2002)

    Google Scholar 

  9. K. Collins-Thompson, J. Callan, A language modeling approach to predicting reading difficulty, in Proceedings of HLT/NAACL, vol. 4 (2004)

    Google Scholar 

  10. A. Culotta, J. Sorensen, Dependency tree kernels for relation extraction, in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (Association for Computational Linguistics, Stroudsburg, 2004), p. 423

    Google Scholar 

  11. A. Esuli, F. Sebastiani, Determining term subjectivity and term orientation for opinion mining, in Proceedings the 11th Meeting of the European Chapter of the Association for Computational Linguistics (EACL-2006) (2006), pp. 193–200

    Google Scholar 

  12. D.C. Gondek, A. Lally, A. Kalyanpur, J.W. Murdock, P.A. Duboue, L. Zhang, Y. Pan, Z.M. Qiu, C. Welty, A framework for merging and ranking of answers in DeepQA. IBM J. Res. Dev. 56(3), 399–410 (2012)

    Google Scholar 

  13. S. Grimes, Unstructured data and the 80 percent rule. Carabridge Bridgepoints (2008)

    Google Scholar 

  14. X. Huang, A. Acero, H.W. Hon, et al., Spoken Language Processing, vol. 15 (Prentice Hall, New York, 2001)

    Google Scholar 

  15. T. Joachims, Making Large-Scale Support Vector Machine Learning Practical (MIT Press, Cambridge, 1999), pp. 169–184

    Google Scholar 

  16. A. Kalyanpur, B.K. Boguraev, S. Patwardhan, J.W. Murdock, A. Lally, C. Welty, J.M. Prager, B. Coppola, A. Fokoue-Nkoutche, L. Zhang, Y. Pan, Z.M. Qiu, Structured data and inference in DeepQA. IBM J. Res. Dev. 56(3.4), 10 (2012). doi:10.1147/JRD.2012.2188737

    Google Scholar 

  17. V. Kešelj, F. Peng, N. Cercone, C. Thomas, N-gram-based author profiles for authorship attribution, in Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING’03 (2003)

    Google Scholar 

  18. P. Kingsbury, M. Palmer, From TreeBank to PropBank, in Proceedings of LREC (2002)

    Google Scholar 

  19. D. Klein, C.D. Manning, Accurate unlexicalized parsing, in Proceedings of ACL (Association for Computational Linguistics, Stroudsburg, 2003), pp. 423–430

    Google Scholar 

  20. J.D. Lafferty, A. McCallum, F.C.N. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in ICML, ed. by C.E. Brodley, A.P. Danyluk (Morgan Kaufmann, San Mateo, 2001), pp. 282–289

    Google Scholar 

  21. K.-F. Lee, Automatic Speech Recognition: the Development of the Sphinx Recognition System, vol. 62 (Kluwer Academic, Norwell, 1989)

    Book  Google Scholar 

  22. C. Lee, Y.-G. Hwang, M.-G. Jang, Fine-grained named entity recognition and relation extraction for question answering, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’07 (ACM, New York, 2007), pp. 799–800

    Chapter  Google Scholar 

  23. X. Li, D. Roth, Learning question classifiers, in Proceedings of the 19th International Conference on Computational Linguistics—Volume 1. COLING’02 (Association for Computational Linguistics, Stroudsburg, 2002), pp. 1–7

    Chapter  Google Scholar 

  24. A. Moschitti, S. Quarteroni, Linguistic kernels for answer re-ranking in question answering systems. Inf. Process. Manag. 47(6), 825–842 (2011)

    Article  Google Scholar 

  25. A. Moschitti, S. Quarteroni, R. Basili, S. Manandhar, Exploiting syntactic and shallow semantic kernels for question answer classification, in ACL (2007)

    Google Scholar 

  26. L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: bringing order to the web, Technical report, Stanford InfoLab, 1999

    Google Scholar 

  27. S. Quarteroni, S. Manandhar, Designing an interactive open-domain question answering system. Nat. Lang. Eng. 15(1), 73–95 (2009)

    Article  Google Scholar 

  28. S. Quarteroni, A.V. Ivanov, G. Riccardi, Simultaneous dialog act segmentation and classification from human–human spoken conversations, in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE Press, New York, 2011), pp. 5596–5599

    Chapter  Google Scholar 

  29. L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  30. E. Saquete, P. Martinez-Barco, R. Munoz, J. Vicedo, Splitting complex temporal questions for question answering systems, in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (Association for Computational Linguistics, Stroudsburg, 2004), p. 566

    Google Scholar 

  31. F. Sebastiani, Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  32. C.C. Shilakes, J. Tylman, Enterprise Information Portals (Merrill, Columbus, 1998), p. 16

    Google Scholar 

  33. R.F. Simmons, Answering English questions by computer: a survey. Commun. ACM 8(1), 53–70 (1965)

    Article  Google Scholar 

  34. A. Stolcke, SRILM-an extensible language modeling toolkit, in Seventh International Conference on Spoken Language Processing (2002)

    Google Scholar 

  35. M. Surdeanu, M. Ciaramita, H. Zaragoza, Learning to rank answers to non-factoid questions from web collections. Comput. Linguist. 37(2), 351–383 (2011)

    Article  Google Scholar 

  36. Y. Yang, An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1), 69–90 (1999)

    Article  Google Scholar 

  37. D. Zelenko, C. Aone, A. Richardella, Kernel methods for relation extraction, in JMLR (2003)

    Google Scholar 

  38. C. Zhai, J. Lafferty, A study of smoothing methods for language models applied to ad hoc information retrieval, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’01 (ACM, New York, 2001), pp. 334–342

    Google Scholar 

  39. D. Zhang, W.S. Lee, Question classification using support vector machines, in Proceedings of SIGIR (ACM, New York, 2003)

    Google Scholar 

  40. G.D. Zhou, J. Su, Named entity recognition using an HMM-based chunk tagger, in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Association for Computational Linguistics, Stroudsburg, 2002), pp. 473–480

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Ceri, S., Bozzon, A., Brambilla, M., Della Valle, E., Fraternali, P., Quarteroni, S. (2013). Natural Language Processing for Search. In: Web Information Retrieval. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39314-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39314-3_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39313-6

  • Online ISBN: 978-3-642-39314-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics