Skip to main content

Reverse-Engineering Question/Answer Collections From Ordinary Text

  • Chapter
Advances in Open Domain Question Answering

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 32))

Researchers have begun to investigate the use of statistical and machine learning methods for question answering. These techniques require training data, usually in the form of question/answer sets. In this chapter, we describe a reverse-engineering procedure that can be used to generate question/answer sets automatically from ordinary text corpora. Our technique identifies sentences that are good candidates for question/answer extraction, extracts the portions of the sentence corresponding to the question and the answer, and then transforms the information into an actual question and answer. Using this procedure, a collection of questions and answers can be automatically generated from any text corpus. One key benefit of this automatic procedure is that question/answer sets can be easily generated from domain-specific corpora, creating training data which could be used to build a Q/A system tailored for a specific domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

7. References

  • Barzilay, Regina and McKeown, Kathleen R. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of ACL/EACL, Toulouse, France.

    Google Scholar 

  • Berger, A., Caruana, R., Cohn, D., Freitag, D., and Mittal, V. (2000). Bridging the lexical chasm: Statistical approaches to answer-finding. Proceedings of the 23rd Annual Conference on Research and Development in Information Retrieval (ACM SIGIR), pages 192-199.

    Google Scholar 

  • Brill, E., Dumais, S., and Banko, M. (2002). An analysis of the askmsr question-answering system. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing., pages 257-264.

    Google Scholar 

  • Caraballo, Sharon (1999). Automatic acquisition of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics.

    Google Scholar 

  • Charniak, E. (1993). Statistical Language Learning. The MIT Press, Cambridge, MA.

    Google Scholar 

  • Charniak, E., Altun, Y., de Salvo Braz, R., Garrett, B., Kosmala, M., Moscovich, T., Pang, L., Pyo, C., Sun, Y., Wy, W., Yang, Z., Zeller, S., and Zorn, L. (2000). Reading Comprehension Programs in a Statistical-Language-Processing Class. In ANLP/NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.

    Google Scholar 

  • Church, K. (1989). A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing.

    Google Scholar 

  • Fleischman, M. and Hovy, E. (2003). Offline strategies for online question answering: Answering questions before they are asked. In the Annual Meeting of the Association for Computational Linguistics, page (to appear).

    Google Scholar 

  • Fujii, Atsushu and Ishikawa, Tetsuya (2001). Question answering using encyclopedic knowledge from the web. In Workshop on Open-Domain Question Answering at ACL.

    Google Scholar 

  • Girju, Roxana (2001). Answer fusion with on-line ontology development. In Student Research Workshop Proceedings at The 2nd Meeting of the North American Chapter of the Association for Computa-tional Linguistics.

    Google Scholar 

  • Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., nu, M. Surdea, Bunescu, R., Girju, R., Rus, V., and Mor, P. (2000). Falcon: Boosting knowledge for answer engines. Proc. of TREC-9.

    Google Scholar 

  • Hearst, Marti (1992). Automatic acquisition of hyponyms from large text corpora. Proceedings of the Fourteenth International Conference on Computational Linguistics (COLING-92).

    Google Scholar 

  • Hermjakob, Ulf (2001). Parsing and questiong classification for question answering. In Workshop on Open-Domain Question Answering at ACL.

    Google Scholar 

  • Hirschman, L., Light, M., Breck, E., and Burger, J. (1999). Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics.

    Google Scholar 

  • Ittycheriah, A. (2003). A statistical approach for open domain question answering. In Harabagiu, S. and Strzalkowski, T., editors, Advances in Open Domain Question Answering. Kluwer.

    Google Scholar 

  • Ittycheriah, A., Franz, M., Zhu, W-J., and Ratnaparkhi, A. (2001). Question Answering Using Maximum Entropy Components. Proceedings of the Second Meeting of The North American Chapter of the Association of Computational Linguistics, pages 33-39.

    Google Scholar 

  • Jacquemin, Christian, Klavens, Judith, and Tzoukermann, Evelyne (1997). Explansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of ACL/EACL, Barcelona, Spain.

    Google Scholar 

  • Light, Marc, Mann, Gideon S., Riloff, Ellen, and Breck, Eric (2001). Analyses for elucidating current question answering technology. Journal of Natural Language Engineering.

    Google Scholar 

  • Lin, Dekang and Pantel, Patrick (2002). Discovery of inference rules for question/answering. Journal for Natural Language Engineering.

    Google Scholar 

  • MacDonald, G. (1999). Phishy web trivia.

    Google Scholar 

  • Mann, Gideon S. (2001). A statistical method for short answer extraction. In Workshop on Open-Domain Question Answering, pages 23-30.

    Google Scholar 

  • Mann, Gideon S. (2002a). Building a proper noun ontology for question answering. In Proceedings of SemaNet02: Building and Using Semantic Networks, Taipei, Taiwan.

    Google Scholar 

  • Mann, Gideon S. (2002b). Learning how to answer questions using trivia games. In Proceedings of the Nineteenth International Conference on Computational Linguistics (COLING 2002).

    Google Scholar 

  • Marcus, M., Santorini, B., and Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2): 313-330.

    Google Scholar 

  • Miller, G. (1990). Wordnet: An On-line Lexical Database. International Journal of Lexicography, 3(4): 235-312.

    Article  Google Scholar 

  • Moldovan, D., Clark, C., Harabagiu, S., and Maiorano, S. (2003). Cogex: A logic prover for question answering. In Proceedings of HLT-NAACL 2003, pages 166-172.

    Google Scholar 

  • MUC-4 Proceedings (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, San Mateo, CA.

    Google Scholar 

  • Ng, H.T., Teo, L.H., and Kwan, J.L.P. (2000a). A Machine Learning Approach to Answering Questions for Reading Comprehension Tests. In Proceedings of EMNLP/VLC-2000 at ACL-2000.

    Google Scholar 

  • Ng, Hwee Tou, Kwan, Jennifer Lai Pheng, and Xia, Yiyuan (2001). Question answering using a larger text database: A machine learning approach. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

    Google Scholar 

  • Ng, Hwee Tou, Teo, Leong Hwee, and Kwan, Jennifer Lai Pheng (2000b). A machine learning approach to answering questions for reading comprehension tests. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 124-132.

    Google Scholar 

  • Phillips, W. and Riloff, E. (2002). Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing.

    Google Scholar 

  • Prager, J. M., Chu-Carroll, J., Brown, E. W., and Czuba, K. (2003). Question Answering by Predictive Annotation. In Harabagiu, S. and Strzalkowski, T., editors, Advances in Open Domain Question Answering. Kluwer.

    Google Scholar 

  • Prager, John, Chu-Carroll, Jennifer, and Czuba, Krzysztof (2002). Statistical answer-type identification in open-domain question answering. In Human Language Technologies Conference.

    Google Scholar 

  • Radev, Dragomir R., Prager, John, and Samn, Valeria (2000). Ranking suspected answers to natural language questions using predictive annotation. In Proceedings of the Sixth Applied Natural Language Processing Conference, pages 150-157.

    Google Scholar 

  • Ravichandran, Deepak and Hovy, Eduard (2002). Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.

    Google Scholar 

  • Reuters Ltd. (1997). Reuters-21578, Distribution 1.0. http://www.research.att.com/∼lewis.

  • Riloff, E. (1996). Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044-1049. The AAAI Press/MIT Press.

    Google Scholar 

  • Riloff, E. and Jones, R. (1999). Learning Dictionaries for Information Extraction by Multi-Level Boot-strapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence.

    Google Scholar 

  • Riloff, E. and Shepherd, J. (1999). A Corpus-based Bootstrapping Algorithm for Semi-Automated Semantic Lexicon Construction. Journal for Natural Language Engineering, 5(2):147-156.

    Article  Google Scholar 

  • Riloff, E. and Thelen, M. (2000). A Rule-based Question Answering System for Reading Comprehension Tests. In ANLP/NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.

    Google Scholar 

  • Roark, B. and Charniak, E. (1998). Noun-phrase Co-occurrence Statistics for Semi-automatic Semantic Lexicon Construction. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pages 1110-1116.

    Google Scholar 

  • Strzalkowski, T., Lin, F., Perez-Caraballo, J., and Wang, J. (1997). Building effective queries in natural language information retrieval. In ANLP, pages 299-306.

    Google Scholar 

  • Thelen, M. and Riloff, E. (2002). A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern Contexts. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing.

    Google Scholar 

  • TREC-10 Proceedings (2001). Proceedings of the Tenth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-250, Gaithersburg, MD.

    Google Scholar 

  • TREC-11 Proceedings (2002). Proceedings of the Eleventh Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-251, Gaithersburg, MD.

    Google Scholar 

  • TREC-8 Proceedings (1999). Proceedings of the Eighth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-246, Gaithersburg, MD.

    Google Scholar 

  • TREC-9 Proceedings (2000). Proceedings of the Ninth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-249, Gaithersburg, MD.

    Google Scholar 

  • TriviaMachine Inc. (1999). TriviaSpot.com. www.triviaspot.com .

  • Turtle, Howard and Croft, W. Bruce (1991). Efficient Probabilistic Inference for Text Retrieval. In Proceedings of RIAO 91, pages 644-661.

    Google Scholar 

  • Wang, W., J., Auer, Parasuraman, R., Zubarev, I., Brandyberry, D., and Harper, M.P. (2000). A Question Answering System Developed as a Project in a Natural Language Processing Course. In ANLP/ NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.

    Google Scholar 

  • Weischedel, R., Meteer, M., Schwartz, R., Ramshaw, L., and Palmucci, J. (1993). Coping with Ambiguity and Unknown Words through Probabilistic Models. Computational Linguistics, 19(2):359-382.

    Google Scholar 

  • Yangarber, R., Grishman, R., Tapanainen, P., and Huttunen, S. (2000). Automatic Acquisiton of Domain Knowledge for Information Extraction. In Proceedings of the Eighteenth International Conference on Computational Linguistics (COLING 2000).

    Google Scholar 

  • Yarowsky, D. (1992). Word sense disambiguation using statistical models of Roget’s categories trained on large corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics (COLING-92), pages 454-460.

    Google Scholar 

  • Yarowsky, D. (1995). Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer

About this chapter

Cite this chapter

Riloff, E., Mann, G.S., Phillips, W. (2008). Reverse-Engineering Question/Answer Collections From Ordinary Text. In: Strzalkowski, T., Harabagiu, S.M. (eds) Advances in Open Domain Question Answering. Text, Speech and Language Technology, vol 32. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-4746-6_17

Download citation

Publish with us

Policies and ethics