Skip to main content

Application Domain, Human Factors, and Dialogue

  • Chapter
Robustness in Automatic Speech Recognition

Summary

After considering important properties required by speech recognition applications, this chapter concentrates on the use of application domain knowledge, the user interface, and the use of dialogue to improve the robustness of an application using speech. At the dialogue level, the emphasis is put on multimodality, error correction, naturalness, and the need for different dialogue strategies depending on the application considered. To build application prototypes and to reduce development costs, it is important to develop a flexible technology which can be adapted to various applications belonging to the same class. As important steps towards this goal, vocabulary-independent recognition, application-independent dialogue architectures, and the notion of global speech interface are considered. When an application prototype exists, it needs to be evaluated. This leads us to a description of various assessment methodologies recently developed. We conclude this chapter by the presentation of a robust real-world application, where we stress the factors contributing to its robustness. Finally, we briefly provide some perspectives on the most promising application domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Ainsworth, W. and Pratt, S. (1993). Comparing error correction strategies in speech recognition systems. In Baber, C. and Noyes, J., editors, Interactive Speech Technology, pages 131–135. Taylor&Francis.

    Google Scholar 

  • Anglade, Y. (1994). Robustesse de la Reconnaissance Automatique de la Parole: Etude et Application dans un Système d’Aide Vocal pour une Standardiste Mal-Voyante. Ph.D. thesis. Université Henri Poincaré, Nancy I.

    Google Scholar 

  • Anglade, Y., Fohr, D., and Junqua, J.-C. (1992). A robust discrimination method based on selectively trained neural networks. In ETRW: Speech Processing in Adverse Conditions, pages 175–178.

    Google Scholar 

  • Anglade, Y., Fohr, D., and Junqua, J.-C. (1993a). Speech discrimination in adverse conditions using acoustic knowledge and selectively trained neural networks. In ICASSP, pages II.279–II.282.

    Google Scholar 

  • Anglade, Y., Pierrel, J.-M., and Junqua, J.-C. (1991). A spoken language interface for a telephone switchboard operator center. In EUROSPEECH, pages 307–310.

    Google Scholar 

  • Anglade, Y., Pierrel, J.-M., and Junqua, J.-C. (1993b). TOBIE-SOL: A conversational system for a telephone switchboard operator center. JAVIOS, 14:23–39.

    Google Scholar 

  • Atal, B. (1994). Speech technology in 2001: New research directions. In Roe, D. and Wilpon, J., editors, Voice Communication Between Humans And Machines, pages 467–481. National Academy Press.

    Google Scholar 

  • Baber, C. (1991). Speech Technology in Control Room. Ellis Horwood.

    Google Scholar 

  • Bahl, L., Bakis, R., Cohen, P., Cole, A., Jelinek, F., Lewis, B., and Mercer, R. (1980). Further results on the recognition of a continuously read natural corpus. In ICASSP, pages 872–875.

    Google Scholar 

  • Bamberg, P. and Gillick, L. (1990). Phoneme-in-context modeling for Dragon’s continuous speech recognizer. In DARPA Workshop on Speech Recognition, pages 163–169.

    Google Scholar 

  • Bardaud, P., Capman, F., Mokbel, C, Tadj, C, and Chollet, G. (1992). Transformation of databases for the evaluation of speech recognizers. In ICSLP, pages 1431–1434.

    Google Scholar 

  • Bates, M. and Ayuso, D. (1991). A proposal for incremental dialogue evaluation. In DARPA Workshop Speech and Natural Language, pages 319–322.

    Chapter  Google Scholar 

  • Bates, M., Boisen, S., and Makhoul, J. (1990). Developing and evaluation methodology for spoken language systems. In DARPA Workshop on Speech Recognition, pages 102–108.

    Google Scholar 

  • Bourjot, C., Boyer, A., and Fohr, D. (1991). A tool for assessment of acoustic phonetic lattices. In EUROSPEECH, pages 521–528.

    Google Scholar 

  • Chollet, G. and Gagnoulet, C. (1982). On the evaluation of speech recognizers and data bases using a reference system. In ICASSP, pages 2026–2029.

    Google Scholar 

  • Chow, Y., Schwartz, R., Roucos, S., Kimball, O., Price, P., Kubala, F., Dunham, M., Krasner, M., and Makhoul, J. (1986). The role of word-dependent coarticulato-ry effects in a phoneme-based speech recognition system. In ICASSP, pages 1593–1596.

    Google Scholar 

  • Cole, R., Hirschman, L., Atlas, L., Beckman, M., Biermann, A., Bush, M., Clements, M., Cohen, J., Garcia, O., Hanson, B., Hermansky, H., Levinson, S., McKeown, K., Morgan, N., Novick, D., Ostendorf, M., Oviatt, S., Price, P., Silverman, H., Spitz, J., Waibel, A., Weinstein, C, Zahorian, S., and Zue, V. (1995). The challenge of spoken language systems: Research directions for die nineties. IEEE Trans, on Speech and Audio Processing, 3(1):1–21.

    Article  Google Scholar 

  • Cosky, M., Lively, B., Roberts, L., and Wattenbarger, B. (1995). Talking to machines today and tomorrow: Designing for the user. AT&T Technical Journal, pages 81–91.

    Google Scholar 

  • Coutaz, J. and Caelen, J. (1991). A taxonomy for multimedia and multimodal interfaces. In ERCIM Workshop, INESC Lisbon.

    Google Scholar 

  • Damper, R. (1993). Speech as an interface medium: How can it best be used? In Baber, C. and Noyes, J., editors. Interactive Speech Technology, pages 59–71. Taylor&Francis.

    Google Scholar 

  • Danielsen, S. (1990). Standardisation of speech input assessment within the SAM ESPRIT project. In ICSLP, pages 1021–1024.

    Google Scholar 

  • Danielsen, S. (1993). Enhanced direct assessment of speech input systems within the SAM-A ESPRIT project. In EUROSPEECH, pages 207–210.

    Google Scholar 

  • Delogu, C., Di Carlo, A., Sementina, C., and Stecconi, S. (1993). A methodology for evaluating human-machine spoken language interaction. In EUROSPEECH, pages 1427–1430.

    Google Scholar 

  • D’Orta, P., Ferretti, M., and Scarci, S. (1987). Phoneme classification for real time speech recognition of Italian, In ICASSP, pages 81–84.

    Google Scholar 

  • ESCA-NATO/RSG-10 (1993). Proceedings of Applications of Speech Technology, Joint ESCA-NATO/RSG-10 Workshop, Lautrach, Germany. September.

    Google Scholar 

  • ETRW (1991). Proceedings of the Second Venaco Workshop: The Structure of Multimodal Dialogue, Maratea, Italy. September.

    Google Scholar 

  • ETRW (1992). Proceedings of the Workshop: Speech Processing in Adverse Conditions. November.

    Google Scholar 

  • ETRW (1995). Proceedings of the Workshop on Spoken Dialogue Systems, Vigsø, Denmark. June.

    Google Scholar 

  • Fellbaum, K., Heinstein, R., and Loebner, H. (1989). Speech dialogue systems — state of the art and selected applications. In EUROSPEECH, pages 433–436.

    Google Scholar 

  • Frankish, C. and Noyes, J. (1993). Feedback in automatic speech recognition: Who is saying what and to whom? In Baber, C. and Noyes, J., editors, Interactive Speech Technology, pages 121–130. Taylor&Francis.

    Google Scholar 

  • Fraser, N. and Gilbert, G. (1991). Simulating speech systems. Computer Speech and Language, 5:81–99.

    Article  Google Scholar 

  • Furui, S. (1994). Toward the ultimate synthesis/recognition system. In Roe, D. and Wilpon, J., editors, Voice Communication Between Humans And Machines, pages 450–466. National Academy Press.

    Google Scholar 

  • Gaiffe, B., Romary, L., and Pierrel, J.-M. (1991). Refering in a multimodal environment: From NL to designation. In Second Venaco Workshop: The Structure of Multimodal Dialogue.

    Google Scholar 

  • Gavignet, F., Guyomard, M., and Siroux, J. (1991). Implementing an oral and graphic multimodal application: The GEORAL project. In Second Venaco Workshop: The Structure of Multimodal Dialogue.

    Google Scholar 

  • Gerbino, E., Baggia, P., Ciaramella, A., and Rullent, C. (1993). Test and evaluation of a spoken dialogue system. In ICASSP, pages II.135–II.138.

    Google Scholar 

  • Hapeshi, K. (1993). Design guidelines for using speech in interactive multimedia systems. In Baber, C. and Noyes, J., editors, Interactive Speech Technology, pages 177–188. Taylor&Francis.

    Google Scholar 

  • Hapeshi, K. and Jones, D. (1989). The ergonomics of automatic speech recognition interfaces. In Oborne, D., editor, International Review of Ergonomics, volume 2, pages 251–290. Taylor&Francis.

    Google Scholar 

  • Haton, J.-P., Pierrel, J.-M., Caelen, J., Perennou, G., and Gauvain, J.-L. (1991). Reconnaissance Automatique de la Parole. Dunod.

    Google Scholar 

  • Hieronymus, J. and Majurski, W. (1985). A reference speech recognition algorithm for benchmarking and speech data base analysis. In ICASSP, pages 1573–1576.

    Google Scholar 

  • Hon, H. (1992). Vocabulary Independent Speech Recognition: The VOCIND System. Ph.D. thesis. Carnegie Mellon University.

    Google Scholar 

  • Hunt, M. (1990). Figures of merit for assessing connected-word recognisers. Speech Communication, Special Issue on Speech Input/Output Assessment and Speech Databases, 9(4):329–336.

    Google Scholar 

  • Hunt, M., Lennig, M., and Mermelstein, P. (1980). Experiments in syllable-based recognition of continuous speech. In ICASSP, pages 880–883.

    Google Scholar 

  • IEEE (1991). Proceedings of the IEEE Workshop on Automatic Speech Recognition, Arden House, Harriman, NY. December.

    Google Scholar 

  • ISSD (1993). Proceedings of the International Symposium on Spoken Dialogue, Waseda University, Tokyo, Japan. November.

    Google Scholar 

  • Junqua, J.-C. and Morin, P. (1993). Towards successful and usable applications using speech technology. In ESCA-NATOIRSG10 Workshop, Applications of Speech Technology.

    Google Scholar 

  • Junqua, J.-C. and Morin, P. (1994). Naturalness of the interaction in multimodal applications. In ICSLP, pages 563–566.

    Google Scholar 

  • Kao, Y., Hemphill, C., Wheatley, B., and Rajasekaran, P. (1994). Toward vocabulary independent telephone speech recognition. In ICASSP, pages I.117–1.120.

    Google Scholar 

  • Kompe, R., Kiessling, A., Kuhn, T., Mast, M., Niemann, H., Nöth, E., Ott, K., and Batliner, A. (1993). Prosody takes over: A prosodically guided dialog system. In EUROSPEECH, pages 2003–2006.

    Google Scholar 

  • Larsen, L. and Baekgaard, A. (1994). Rapid prototyping of a dialogue system using a generic dialogue development platform. In ICSLP, pages 919–922.

    Google Scholar 

  • Lee, K. (1988). Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System. PhD. thesis. Carnegie Mellon University.

    Google Scholar 

  • Lee, K.-F. (1990). Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition. IEEE Trans. ASSP, ASSP-38(4):599–609.

    Article  Google Scholar 

  • Lee, K.-F., Hayamizu, S., Hon, H.-W., Huang, C., Swartz, J., and Weide, R. (1990). Allophone clustering for continuous speech recognition. In ICASSP, pages 749–752.

    Google Scholar 

  • Lee, K.-F. and Mahajan, S. (1990). Corrective and reinforcement learning for speaker-independent continuous speech recognition. Computer Speech and Language, 4:231–245.

    Article  Google Scholar 

  • Levinson, S. and Fallside, F. (1994). Speech technology in the year 2001. In Roe, D. and Wilpon, J., editors, Voice Communication Between Humans And Machines, pages 445–449. National Academy Press.

    Google Scholar 

  • Lindberg, B. and Danielsen, S. (1989). Specification of the low level SESAM. Technical report, ESPRIT Project 1541, Extension Phase, Final Report.

    Google Scholar 

  • Lunati, J. and Rudnicky, A. (1990). The design of a spoken language interface. In DARPA Workshop on Spoken Language Systems, pages 225–229.

    Google Scholar 

  • Mathan, L. and Morin, D. (1991). Speech field databases: Development and analysis. In EUROSPEECH, pages 509–511.

    Google Scholar 

  • Matrouf, K. and Néel, F. (1991). Use of upper level knowledge to improve human-machine interaction. In Second Venaco Workshop: The Structure of Multimodal Dialogue.

    Google Scholar 

  • McCauley, M. (1984). Human factors in voice technology. In Muckler, F., editor, Human Factors Review, pages 131–166. Human Factors Society.

    Google Scholar 

  • Moore, R. (1977). Evaluating speech recognizers. IEEE Trans. ASSP, ASSP-25.178–183.

    Google Scholar 

  • Morin, P., Junqua, J.-C., and Pierrel, J.-M. (1992). A flexible multimodal dialogue architecture independent of the application. In ICSLP, pages 939–942.

    Google Scholar 

  • Nielsen, P. and Baekgaard, A. (1992). Experience with a dialogue description formalism for realistic applications. In ICSLP, pages 719–722.

    Google Scholar 

  • Nusbaum, H. and Pisoni, D. (1987). Automatic measurement of speech recognition performance: A comparison of six speaker-dependent recognition devices. Computer Speech and Language, 2:87–108.

    Article  Google Scholar 

  • Paul, D. and Martin, E. (1988). Speaker stress-resistant continuous speech recognition. In ICASSP, pages 283–286.

    Google Scholar 

  • Peckham, J. (1991). Speech understanding and dialogue over the telephone: An overview of progress in the SUNDIAL project. In EUROSPEECH, pages 1469–1472.

    Google Scholar 

  • Peckham, J., Thomas, T., and Frangoulis, E. (1990). Recogniser sensitivity analysis: A method for assessing the performance of speech recognizers. Speech Communication, Special Issue on Speech Input/Output Assessment and Speech Databases, 9(4):317–327.

    Google Scholar 

  • Pierrel, J.-M. (1989). Rapport de faisabilité d’un poste de standardiste pour mal-voyant utilisant un dialogue homme-machine multi-media incluant une forte composante orale. Technical report, CRIN-SOLLAC.

    Google Scholar 

  • Polifroni, J., Hirschman, L., Seneff, S., and Zue, V. (1992). Experiments in evaluating interactive spoken language systems. In DARPA Workshop Speech and Natural Language, pages 28–33.

    Chapter  Google Scholar 

  • Price, P., Hirschman, L., Shriberg, E., and Wade, E. (1992). Subject-based evaluation measures for interactive spoken language systems. In DARPA Workshop Speech and Natural Language, pages 34–39.

    Chapter  Google Scholar 

  • Riccio, A., Ceglie, F., and Brancaccio, A. (1993). Reliable assessment of speech recognisers for telephone environment. In EUROSPEECH, pages 1885–1888.

    Google Scholar 

  • Roe, D. and Wilpon, J. (1993). Whither speech recognition: The next 25 years. IEEE Communications Magazine, pages 54–62.

    Google Scholar 

  • Rose, R. (1993). Definition of subword acoustic units for wordspotting. In EURO-SPEECH, pages 1049–1052.

    Google Scholar 

  • Rudnicky, A. (1989). The design of voice-driven interfaces. In DARPA Workshop on Spoken Language Systems, pages 120–124.

    Google Scholar 

  • Schank, R. and Abelson, R. (1977). Scripts, Plans, Goals and Understanding. Lawrence Erlbaum.

    MATH  Google Scholar 

  • Schwartz, R., Chow, Y., Roucos, S„ Krasner, M., and Makhoul, J. (1984). Improved hidden Markov modeling phonemes for continuous speech recognition. In ICASSP, pages 35.6.1–35.6.4.

    Google Scholar 

  • Seto, S., Kanazawa, H., Shinchi, H., and Takebayashi, Y. (1994). Spontaneous speech dialogue system TOSBURG II and its evaluation. Speech Communication, 15(3–4):341–353.

    Article  Google Scholar 

  • Shirai, K. and Furui, S. (1994). Special issue on spoken dialogue: K. Shirai and S. Fund, editors. Speech Communication, 15(3–4).

    Google Scholar 

  • Simpson, A. and Fraser, N. (1993). Black box and glass box evaluation of the SUNDIAL system. In EUROSPEECH, pages 1423–1426.

    Google Scholar 

  • Siroux, J., Kharoune, M., and Guyomard, M. (1994). Application and dialogue in the SUNDIAL system. In ICSLP, pages 927–930.

    Google Scholar 

  • Sloboda, T. (1995). Dictionary learning: Performance through consistency. In ICASSP, pages 453–456.

    Google Scholar 

  • Steeneken, H. and Van Velden, J. (1991). RAMOS—Recognizer Assessment by means of Manipulation Of Speech applied to connected speech recognition. In EURO-SPEECH, pages 529–532.

    Google Scholar 

  • Steeneken, H. and Varga, A. (1993). Assessment for automatic speech recognition: I. Comparison of assessment methods. Speech Communication, 12(3):241–246.

    Article  Google Scholar 

  • Strong, R. (1993). CASPER: A speech interface for the Macintosh. In EUROSPEECH, pages 2073–2076.

    Google Scholar 

  • Tate, M., Webster, R., and Weeks, R. (1993). Evaluation and prototyping of dialogues for voice applications. In Baber, C. and Noyes, J., editors, Interactive Speech Technology, pages 157–165. Taylor&Francis.

    Google Scholar 

  • Taylor, M., Néel, F., and D. G. Bouwhuis (1989). The Structure of Multimodal Dialogue: M.M. Taylor, F. Néel, and D.G. Bouwhuis, editors. Elsevier Science Publishers.

    Google Scholar 

  • Van de Vegte, J. and Taylor, M. (1990). Testing the effective vocabulary capacity method of evaluating speech recognizers. Speech Communication, Special Issue on Speech Input/Output Assessment and Speech Databases, 9(4):337–347.

    Google Scholar 

  • Vaiga, A. and Steeneken, H. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3):247–251.

    Article  Google Scholar 

  • Wayrd, P. (1993). The comparative assessment of commercial speech recognizers. In EUROSPEECH, pages 1881–1884.

    Google Scholar 

  • Wilson, M., Sedlock, D., Binot, J., and Falzon, P. (1991). An architecture for multimodal dialogue. In Second Venaco Workshop: The Structure of Multimodal Dialogue.

    Google Scholar 

  • Winski, R. and Kordi, K. (1991). Assessment of continuous speech recognizers using recognizer sensitivity analysis. In EUROSPEECH, pages 521–524.

    Google Scholar 

  • Wong, M. (1994). Clustering triphones by phonological mapping. In ICSLP, pages 1939–1942.

    Google Scholar 

  • Wood, L., Pearce, D., and Novello, F. (1991). Improved vocabulary-independent sub-word HMM modelling. In ICASSP, pages 181–184.

    Google Scholar 

  • Yankelovich, N. and Baatz, E. (1994). A framework for building speech applications. In AVIOS, pages 179–188.

    Google Scholar 

  • Young, S. (1989). The MINDS system: Using context and dialog to enhance speech recognition. In DARPA Speech and Natural Language, pages 131–136.

    Google Scholar 

  • Young, S. and Proctor, C. (1989). The design and implementation of dialogue control in voice operated database inquiry systems. Computer Speech and Language, 3:329–353.

    Article  Google Scholar 

  • Zue, V., Glass, J., Goodine, D., Leung, H., Philipps, M., Polifroni, J., and Seneff, S. (1990). The VOYAGER speech understanding system: Preliminary development and evaluation. In ICASSP, pages 73–76.

    Google Scholar 

  • Zue, V., Seneff, S., Polifroni, J., Philipps, M., Pao, C, Goodine, D., Goddeau, D., and Glass, J. (1994). PEGASUS: A spoken dialogue interface for on-line air travel planning. Speech Communication, 15(3–4):331–340.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Kluwer Academic Publishers

About this chapter

Cite this chapter

Junqua, JC., Haton, JP. (1996). Application Domain, Human Factors, and Dialogue. In: Robustness in Automatic Speech Recognition. The Kluwer International Series in Engineering and Computer Science, vol 341. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1297-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-1297-0_13

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4612-8555-7

  • Online ISBN: 978-1-4613-1297-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics