Abstract
Anyone who has seen more than three science-fiction films will probably be able to suggest potential uses for spoken dialogue systems in space. Until recently, however, NASA and other space agencies have shown a surprising lack of interest in attempting to make this dream a reality and it is only in the last few years that any serious work has been carried out. The present chapter describes Clarissa, an experimental voice-enabled system developed at NASA Ames Research Center during a 3-year project starting in early 2002, which enables astronauts to navigate complex procedures using only spoken input and output. Clarissa was successfully tested on the International Space Station (ISS) on June 27, 2005, and is, to the best of our knowledge, the first spoken dialogue application in space.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
In this connection, we would particularly like to mention T.J. Creamer and Mike Fincke.
- 2.
Note that the grammar’s “logical forms” and the dialogue manager’s “dialogue moves” are not the same.
- 3.
For long side-conversations, the user has the option of using the “suspend” command (cf. Section 2.1) to pause recognition.
References
Aist, G., Dowding, J., Hockey, B. A., Hieronymus, J. (2002). An intelligent procedure assistant for astronaut training and support. In: Proc. 40th Annual Meeting of the Association for Computational Linguistics (demo track), Philadelphia, PA, 5–8.
Martin, D., Cheyer, A., Moran, D. (1999). The open agent architecture: a framework for building distributed software systems. Appl. Artif. Intell., 13 (1–2), 92–128.
Nuance (2006). http://www.nuance.com. As of 15 November 2006.
Knight, S., Gorrell, G., Rayner, M., Milward, D., Koeling, R., Lewin, I. (2001). Comparing grammar-based and robust approaches to speech understanding: a case study. In: Proc. Eurospeech 2001, Aalborg, Denmark, 1779–1782.
Rayner, M., Hockey, B. A., Bouillon, P. (2006). Putting Linguistics into Speech Recognition: The Regulus Grammar Compiler. CSLI, Chicago, IL.
Regulus (2006). http://sourceforge.net/projects/regulus/. As of 15 November 2006.
Pulman, S. G. (1992). Syntactic and semantic processing. In: Alshawi, H. (ed) The Core Language Engine, MIT, Cambridge, MA, 129–148.
van Harmelen, T., Bundy, A. (1988). Explanation-based generalization – partial evaluation (research note). Artif. Intell., 36, 401–412.
Rayner, M. (1988). Applying explanation-based generalization to natural-language processing. In: Proc. Int. Conf. on Fifth Generation Computer Systems, Tokyo, Japan, 1267–1274.
Rayner, M., Hockey, B. A. (2003). Transparent combination of rule-based and data-driven approaches in a speech understanding architecture. In: Proc. 10th Conf. Eur. Chapter of the Association for Computational Linguistics, Budapest, Hungary, 299–306.
Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution. In: Proc. 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, 88–95.
Carter, D. (2000). Choosing between interpretations. In: Rayner, M., Carter, D., Bouillon, P., Digalakis, V., Wirén, M. (eds) The Spoken Language Translator, Cambridge University Press, Cambridge, MA, 78–97.
Dowding, J., Hieronymus, J. (2003). A spoken dialogue interface to a geologist’s field assistant. In: Proc. HLT-NAACL 2003: Demo Session, Edmonton, Alberta, 9–10.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In: Proc. 10th Eur. Conf. on Machine Learning, Chemnitz, Germany, 137–142.
Joachims, T. (2006). http://svmlight.joachims.org/. As of 15 November 2006.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C. (2002). Text classification using string kernels. J. Machine Learn. Res., 2, 419–444.
Shawe-Taylor, J., Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge.
Navia-Vázquez, A., Pérez-Cruz, F., Artés-Rodríguez, A., Figueiras-Vidal, A. R. (2004). Advantages of unbiased support vector classifiers for data mining applications. J. VLSI Signal Process. Syst., 37 (1–2), 1035–1062.
Bennett, P. (2003). Using asymmetric distributions to improve text classifier probability estimates. In: Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, Ontario, 111–118.
Zadrozny, B., Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In: Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, 694–699.
Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., Stent, A. (2000). An architecture for a generic dialogue shell. Natural Language Engineering, Special Issue on Best Practice in Spoken Language Dialogue Systems Engineering, 6, 1–16.
Larsson, S., Traum, D. (2000). Information state and dialogue management in the TRINDI dialogue move engine toolkit. Natural Language Engineering, Special Issue on Best Practice in Spoken Language Dialogue Systems Engineering, 6, 323–340.
Stent, A., Dowding, J., Gawron, J., Bratt, E., Moore, R. (1999). The CommandTalk spoken dialogue system. In: Proc. 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, College Park, Maryland, VA, 183–190.
Haffner, P., Cortes, C., Mohri, M. (2003). Lattice kernels for spoken-dialog classification. In: Proc. 2003 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP ’03), Hong Kong, 628–631.
Cortes, C., Haffner, P., Mohri, M. (2004). Rational kernels: Theory and algorithms. J. Machine Learn. Res., 5, 1035–1062.
Acknowledgments
Work at ICSI, UCSC and RIACS was supported by NASA Ames Research Center internal funding. Work at XRCE was partly supported by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. Several people not credited here as co-authors also contributed to the implementation of the Clarissa system: among these, we would particularly like to mention John Dowding, Susana Early, Claire Castillo, Amy Fischer and Vladimir Tkachenko. This publication only reflects the authors’ views.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Detailed Results for System Performance
Appendix: Detailed Results for System Performance
This appendix provides detailed performance results justifying the claims made in the main body of the chapter. We divide it into two parts: the first concerns the recognition task and the second the accept/reject task.
12.1.1 The Recognition Task
Table 12.7 presents the results of experiments contrasting speech understanding performance for the Regulus-based recogniser and the class N-gram recogniser, using several different sets of Alterf features (cf. Section 4.3). For completeness, we also present results for simulated perfect recognition, i.e. using the reference transcriptions. We used six different sets of Alterf features:
-
N-grams: N-gram features only.
-
LF: Logical-form-based patterns only.
-
String: String-based patterns only.
-
String + LF: Both string-based and logical-form-based patterns.
-
String + N-grams: Both string-based and N-gram features.
-
String + LF + N-grams: All types of features.
The general significance of these results is discussed at the end of Section 12.4. It is interesting to note that the combination of logical-form-based features and string-based features outperforms logical-form-based features alone (rows G-4 and G-2). Although the difference is small (6.0% versus 6.3%), a pairwise comparison shows that it is significant at the 1% level according to the McNemar sign test. There is no clear evidence that N-gram features are very useful. This supports the standard folklore result that semantic understanding components for command and control applications are more appropriately implemented using hand-coded phrase-spotting patterns than general associational learning techniques.
Table 12.8 presents a breakdown of speech understanding performance, by utterance length, for the best GLM-based and SLM-based versions of the system. There are two main points to note here. First, speech understanding performance remains respectable even for the longer utterances; second, the performance of the GLM-based version is consistently better than that of the SLM-based version for all utterance lengths.
12.1.2 The Accept/Reject Task
Table 12.9 presents detailed results for the experiments on response filtering described in Section 12.5. All conclusions were confirmed by hypothesis testing, using the Wilcoxon rank test, at the 5% significance level. In the remainder of this section, we assess the impact made by individual techniques.
12.1.3 Kernel Types
Quadratic kernels performed better than linear (around 25% relative improvement in classification error); however, this advantage is less marked when considering the task metric (only 3–9% relative increase). Though small, the difference is statistically significant. This suggests that meaningful information for filtering lies, at least partially, in the co-occurrences of groups of words, rather than just in isolated words.
12.1.4 Asymmetric Error Costs
We next consider the effect of methods designed to take account of asymmetric error costs (cf. Section 12.5). Comparing GQ-1 (no treatment of asymmetric error costs) with GQ-2 (intrinsic SVM-optimisation using the j-parameter) and GQ-3 (calibration), we see that both methods produce a significant improvement in performance. On the u2 loss function that both methods aim to minimise, we attain a 9% relative improvement when using calibration and 6% when using intrinsic SVM optimisation; on the task metric, these gains are reduced to 5% (relative) for calibration, and only 2% for intrinsic SVM-optimisation, though both of these are still statistically significant. Error rates on individual classes show that, as intended, both methods move errors from false accepts (classes B and C) to the less dangerous false rejects (class A). In particular, the calibration method manages to reduce the false accept rate on cross-talk and out-of-domain utterances from 6.8% on GQ-1 to 4.7% on GQ-3 (31% relative), at the cost of an increase from 2.7% to 4.3% in the false reject rate for correctly recognised utterances.
12.1.5 Recognition Methods
Using the confidence threshold method, there was a large difference in performance between the GLM-based GT-1 and the SLM-based ST-1. In particular, the false accept rate for cross-talk and out-of-domain utterances is nearly twice as high (16.5% versus 8.9%) for the SLM-based recogniser. This supports the folklore result that GLM-based recognisers give better performance on the accept/reject task.
When using the SVM-based methods, however, the best GLM-based configuration (GQ-3) performs about as well as the best SLM-based configuration (SQ-1) in terms of average classification error, with both systems scoring about 5.5%. GQ-3 does perform considerably better than SQ-1 in terms of task error (5.4% versus 6.9%, or 21% relative), but this is due to better performance on the speech recognition and semantic interpretation tasks. Our conclusion here is that GLM-based recognisers do not necessarily offer superior performance to SLM-based ones on the accept/reject task, if a more sophisticated method than a simple confidence threshold is used.
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Rayner, M., Hockey, B.A., Renders, JM., Chatzichrisafis, N., Farrell, K. (2010). Spoken Dialogue Application in Space: The Clarissa Procedure Browser. In: Chen, F., Jokinen, K. (eds) Speech Technology. Springer, New York, NY. https://doi.org/10.1007/978-0-387-73819-2_12
Download citation
DOI: https://doi.org/10.1007/978-0-387-73819-2_12
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-73818-5
Online ISBN: 978-0-387-73819-2
eBook Packages: EngineeringEngineering (R0)