Reverse-Engineering Question/Answer Collections From Ordinary Text

Riloff, Ellen; Mann, Gideon S.; Phillips, William

doi:10.1007/978-1-4020-4746-6_17

Ellen Riloff⁵,
Gideon S. Mann⁶ &
William Phillips⁵

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 32))

770 Accesses
2 Citations

Researchers have begun to investigate the use of statistical and machine learning methods for question answering. These techniques require training data, usually in the form of question/answer sets. In this chapter, we describe a reverse-engineering procedure that can be used to generate question/answer sets automatically from ordinary text corpora. Our technique identifies sentences that are good candidates for question/answer extraction, extracts the portions of the sentence corresponding to the question and the answer, and then transforms the information into an actual question and answer. Using this procedure, a collection of questions and answers can be automatically generated from any text corpus. One key benefit of this automatic procedure is that question/answer sets can be easily generated from domain-specific corpora, creating training data which could be used to build a Q/A system tailored for a specific domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

7. References

Barzilay, Regina and McKeown, Kathleen R. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of ACL/EACL, Toulouse, France.
Google Scholar
Berger, A., Caruana, R., Cohn, D., Freitag, D., and Mittal, V. (2000). Bridging the lexical chasm: Statistical approaches to answer-finding. Proceedings of the 23rd Annual Conference on Research and Development in Information Retrieval (ACM SIGIR), pages 192-199.
Google Scholar
Brill, E., Dumais, S., and Banko, M. (2002). An analysis of the askmsr question-answering system. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing., pages 257-264.
Google Scholar
Caraballo, Sharon (1999). Automatic acquisition of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics.
Google Scholar
Charniak, E. (1993). Statistical Language Learning. The MIT Press, Cambridge, MA.
Google Scholar
Charniak, E., Altun, Y., de Salvo Braz, R., Garrett, B., Kosmala, M., Moscovich, T., Pang, L., Pyo, C., Sun, Y., Wy, W., Yang, Z., Zeller, S., and Zorn, L. (2000). Reading Comprehension Programs in a Statistical-Language-Processing Class. In ANLP/NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.
Google Scholar
Church, K. (1989). A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the Second Conference on Applied Natural Language Processing.
Google Scholar
Fleischman, M. and Hovy, E. (2003). Offline strategies for online question answering: Answering questions before they are asked. In the Annual Meeting of the Association for Computational Linguistics, page (to appear).
Google Scholar
Fujii, Atsushu and Ishikawa, Tetsuya (2001). Question answering using encyclopedic knowledge from the web. In Workshop on Open-Domain Question Answering at ACL.
Google Scholar
Girju, Roxana (2001). Answer fusion with on-line ontology development. In Student Research Workshop Proceedings at The 2nd Meeting of the North American Chapter of the Association for Computa-tional Linguistics.
Google Scholar
Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., nu, M. Surdea, Bunescu, R., Girju, R., Rus, V., and Mor, P. (2000). Falcon: Boosting knowledge for answer engines. Proc. of TREC-9.
Google Scholar
Hearst, Marti (1992). Automatic acquisition of hyponyms from large text corpora. Proceedings of the Fourteenth International Conference on Computational Linguistics (COLING-92).
Google Scholar
Hermjakob, Ulf (2001). Parsing and questiong classification for question answering. In Workshop on Open-Domain Question Answering at ACL.
Google Scholar
Hirschman, L., Light, M., Breck, E., and Burger, J. (1999). Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics.
Google Scholar
Ittycheriah, A. (2003). A statistical approach for open domain question answering. In Harabagiu, S. and Strzalkowski, T., editors, Advances in Open Domain Question Answering. Kluwer.
Google Scholar
Ittycheriah, A., Franz, M., Zhu, W-J., and Ratnaparkhi, A. (2001). Question Answering Using Maximum Entropy Components. Proceedings of the Second Meeting of The North American Chapter of the Association of Computational Linguistics, pages 33-39.
Google Scholar
Jacquemin, Christian, Klavens, Judith, and Tzoukermann, Evelyne (1997). Explansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of ACL/EACL, Barcelona, Spain.
Google Scholar
Light, Marc, Mann, Gideon S., Riloff, Ellen, and Breck, Eric (2001). Analyses for elucidating current question answering technology. Journal of Natural Language Engineering.
Google Scholar
Lin, Dekang and Pantel, Patrick (2002). Discovery of inference rules for question/answering. Journal for Natural Language Engineering.
Google Scholar
MacDonald, G. (1999). Phishy web trivia.
Google Scholar
Mann, Gideon S. (2001). A statistical method for short answer extraction. In Workshop on Open-Domain Question Answering, pages 23-30.
Google Scholar
Mann, Gideon S. (2002a). Building a proper noun ontology for question answering. In Proceedings of SemaNet02: Building and Using Semantic Networks, Taipei, Taiwan.
Google Scholar
Mann, Gideon S. (2002b). Learning how to answer questions using trivia games. In Proceedings of the Nineteenth International Conference on Computational Linguistics (COLING 2002).
Google Scholar
Marcus, M., Santorini, B., and Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2): 313-330.
Google Scholar
Miller, G. (1990). Wordnet: An On-line Lexical Database. International Journal of Lexicography, 3(4): 235-312.
Article Google Scholar
Moldovan, D., Clark, C., Harabagiu, S., and Maiorano, S. (2003). Cogex: A logic prover for question answering. In Proceedings of HLT-NAACL 2003, pages 166-172.
Google Scholar
MUC-4 Proceedings (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, San Mateo, CA.
Google Scholar
Ng, H.T., Teo, L.H., and Kwan, J.L.P. (2000a). A Machine Learning Approach to Answering Questions for Reading Comprehension Tests. In Proceedings of EMNLP/VLC-2000 at ACL-2000.
Google Scholar
Ng, Hwee Tou, Kwan, Jennifer Lai Pheng, and Xia, Yiyuan (2001). Question answering using a larger text database: A machine learning approach. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
Google Scholar
Ng, Hwee Tou, Teo, Leong Hwee, and Kwan, Jennifer Lai Pheng (2000b). A machine learning approach to answering questions for reading comprehension tests. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 124-132.
Google Scholar
Phillips, W. and Riloff, E. (2002). Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing.
Google Scholar
Prager, J. M., Chu-Carroll, J., Brown, E. W., and Czuba, K. (2003). Question Answering by Predictive Annotation. In Harabagiu, S. and Strzalkowski, T., editors, Advances in Open Domain Question Answering. Kluwer.
Google Scholar
Prager, John, Chu-Carroll, Jennifer, and Czuba, Krzysztof (2002). Statistical answer-type identification in open-domain question answering. In Human Language Technologies Conference.
Google Scholar
Radev, Dragomir R., Prager, John, and Samn, Valeria (2000). Ranking suspected answers to natural language questions using predictive annotation. In Proceedings of the Sixth Applied Natural Language Processing Conference, pages 150-157.
Google Scholar
Ravichandran, Deepak and Hovy, Eduard (2002). Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
Google Scholar
Reuters Ltd. (1997). Reuters-21578, Distribution 1.0. http://www.research.att.com/∼lewis.
Riloff, E. (1996). Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044-1049. The AAAI Press/MIT Press.
Google Scholar
Riloff, E. and Jones, R. (1999). Learning Dictionaries for Information Extraction by Multi-Level Boot-strapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence.
Google Scholar
Riloff, E. and Shepherd, J. (1999). A Corpus-based Bootstrapping Algorithm for Semi-Automated Semantic Lexicon Construction. Journal for Natural Language Engineering, 5(2):147-156.
Article Google Scholar
Riloff, E. and Thelen, M. (2000). A Rule-based Question Answering System for Reading Comprehension Tests. In ANLP/NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.
Google Scholar
Roark, B. and Charniak, E. (1998). Noun-phrase Co-occurrence Statistics for Semi-automatic Semantic Lexicon Construction. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pages 1110-1116.
Google Scholar
Strzalkowski, T., Lin, F., Perez-Caraballo, J., and Wang, J. (1997). Building effective queries in natural language information retrieval. In ANLP, pages 299-306.
Google Scholar
Thelen, M. and Riloff, E. (2002). A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern Contexts. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing.
Google Scholar
TREC-10 Proceedings (2001). Proceedings of the Tenth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-250, Gaithersburg, MD.
Google Scholar
TREC-11 Proceedings (2002). Proceedings of the Eleventh Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-251, Gaithersburg, MD.
Google Scholar
TREC-8 Proceedings (1999). Proceedings of the Eighth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-246, Gaithersburg, MD.
Google Scholar
TREC-9 Proceedings (2000). Proceedings of the Ninth Text Retrieval Conference. National Institute of Standards and Technology, Special Publication 500-249, Gaithersburg, MD.
Google Scholar
TriviaMachine Inc. (1999). TriviaSpot.com. www.triviaspot.com .
Turtle, Howard and Croft, W. Bruce (1991). Efficient Probabilistic Inference for Text Retrieval. In Proceedings of RIAO 91, pages 644-661.
Google Scholar
Wang, W., J., Auer, Parasuraman, R., Zubarev, I., Brandyberry, D., and Harper, M.P. (2000). A Question Answering System Developed as a Project in a Natural Language Processing Course. In ANLP/ NAACL Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems.
Google Scholar
Weischedel, R., Meteer, M., Schwartz, R., Ramshaw, L., and Palmucci, J. (1993). Coping with Ambiguity and Unknown Words through Probabilistic Models. Computational Linguistics, 19(2):359-382.
Google Scholar
Yangarber, R., Grishman, R., Tapanainen, P., and Huttunen, S. (2000). Automatic Acquisiton of Domain Knowledge for Information Extraction. In Proceedings of the Eighteenth International Conference on Computational Linguistics (COLING 2000).
Google Scholar
Yarowsky, D. (1992). Word sense disambiguation using statistical models of Roget’s categories trained on large corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics (COLING-92), pages 454-460.
Google Scholar
Yarowsky, D. (1995). Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Utah, 84112, Salt Lake, UT, USA
Ellen Riloff & William Phillips
University of Massachusetts, 140 Governors Drive, 01003, Amherst, MA, USA
Gideon S. Mann

Authors

Ellen Riloff
View author publications
You can also search for this author in PubMed Google Scholar
Gideon S. Mann
View author publications
You can also search for this author in PubMed Google Scholar
William Phillips
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

State University of New York at Albany, 1400 Washington Avenue, 12222, Albany, NY, USA
Tomek Strzalkowski
University of Texas at Dallas, 75083, Richardson, TX, USA
Sanda M. Harabagiu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Riloff, E., Mann, G.S., Phillips, W. (2008). Reverse-Engineering Question/Answer Collections From Ordinary Text. In: Strzalkowski, T., Harabagiu, S.M. (eds) Advances in Open Domain Question Answering. Text, Speech and Language Technology, vol 32. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-4746-6_17

Download citation

DOI: https://doi.org/10.1007/978-1-4020-4746-6_17
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-4744-2
Online ISBN: 978-1-4020-4746-6
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics