Advertisement

Assessment and Evaluation of Speech-Based Interactive Systems: From Manual Annotation to Automatic Usability Evaluation

  • Sebastian Möller
Chapter

Abstract

Due to the improvements of speech and language technologies during the last few decades, the demand for assessment and evaluation of such technologies increased significantly. Starting from the assessment of individual system components such as automatic speech recognition (ASR) or text-to-speech (TTS) synthesis, evaluation methods are now required to address the system – and the service which is based on it – as a whole. Both individual component assessment and entire system evaluation are worth being considered here, depending on the question which shall be answered by the assessment or evaluation.

Keywords

Interaction Parameter Speech Recognition Speech Signal Automatic Speech Recognition Semantic Concept 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

The work described in this chapter has partially been carried out at the Institute of Communication Acoustics, Ruhr-University Bochum, and partially at Deutsche Telekom Laboratories, Technische Universität Berlin. The author would like to thank all colleagues and students who contributed to the mentioned work, as well as Robert Schleicher and Klaus-Peter Engelbrecht for their comments on an earlier version of the chapter.

References

  1. 1.
    Hirschman L., Thompson, H. (1997). Overview of evaluation in speech and natural language processing. In: Survey of the State of the Art in Human Language Technology, Cambridge University Press and Giardini Editori, Pisa, 409–414.Google Scholar
  2. 2.
    Mariani, J. (2002). The Aupelf-Uref evaluation-based language engineering actions and related projects. In: Proc. 1st Int. Conf. on Language Resources and Evaluation (LREC’98), Granada, 123–128.Google Scholar
  3. 3.
    Steeneken, H., van Leeuwen, D. (1995). Multi-lingual assessment of speaker independent large vocabulary speech-recognition systems: The SQALE-project. In: Proc. 4th Eur. Conf. on Speech Communication and Technology (EUROSPEECH’95), Madrid, 1271–1274.Google Scholar
  4. 4.
    Young, S., Adda-Decker, M., Aubert, X., Dugast, C., Gauvain, J., Kershaw, D., Lamel, L., Leeuwen, D., Pye, D., Robinson, A., Steeneken, H., Woodland, P. (1997). Multilingual large vocabulary speech recognition: The European SQALE project. Comput. Speech Lang, 11(1), 73–89.Google Scholar
  5. 5.
    Jacquemin, C., Mariani, J., Paroubek, P. (eds) (2005). Parameters describing the interaction with spoken dialogue systems using evaluation within HLT programs: Results and trends. In: Proc. CLASS Pre-Conf. Workshop to LREC 2000, Geneva, Athens.Google Scholar
  6. 6.
    Ernsen, N., Dybkjær, L. (1997). The DISC concerted action. In: Proc. Speech and Language Technology (SALT) Club Workshop on Evaluation in Speech and Language Technology, Sheffield, 35–42.Google Scholar
  7. 7.
    Gibbon, D., Moore, R., Winski, R. (eds) (1997). Handbook on Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin.Google Scholar
  8. 8.
    Fraser, N. (1997). Assessment of Interactive Systems. Handbook on Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin, 564–615.Google Scholar
  9. 9.
    Leeuwen, D., van Steeneken, H. (1997). Assessment of Recognition Systems. Handbook on Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin, 381–407.Google Scholar
  10. 10.
    Bimbot, F., Chollet, G. (1997). Assessment of Speaker Verification Systems. Handbook on Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin, 408–480.Google Scholar
  11. 11.
    van Bezooijen, R., van Heuven, V. (1997). Assessment of Synthesis Systems. Handbook on Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin, 481–563.Google Scholar
  12. 12.
    Gibbon, D., Mertins, I., Moore, R. (2000). Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation. Kluwer, Boston, MA.CrossRefGoogle Scholar
  13. 13.
    ITU-T Recommendation P. 85 (1994). A Method for Subjective Performance Assessment of the Quality of Speech Voice Output Devices. International Telecommunication Union, Geneva.Google Scholar
  14. 14.
    ITU-T Recommendation P. 851 (2003). Subjective Quality Evaluation of Telephone Services Based on Spoken Dialogue Systems. International Telecommunication Union, Geneva.Google Scholar
  15. 15.
    ITU-T Supplement 24 to P-Series Recommendations (2005). Parameters Describing the Interaction With Spoken Dialogue Systems. International Telecommunication Union, Geneva.Google Scholar
  16. 16.
    Jekosch, U. (2000). Sprache hören und beurteilen: Ein Ansatz zur Grundlegung der Sprachquälitatsbeurteilung. Habilitation thesis (unpublished), Universität/Gesamthochschule Essen.Google Scholar
  17. 17.
    Jeksoch, U. (2005). Voice and Speech Quality Perception. Assessment and Evaluation. Springer, Berlin.Google Scholar
  18. 18.
    ISO 9241-11 (1998). Ergonomic Requirements for Office Work with Visual Display Terminals (VDTs). Part 11: Guidance on Usability. International Organization for Standardization, Geneva.Google Scholar
  19. 19.
    Möller, S. (2005). Quality of Telephone-based Spoken Dialogue Systems. Springer, New York, NY.Google Scholar
  20. 20.
    Möller, S. (2002). A new taxonomy for the quality of telephone services based on spoken dialogue systems. In: Proc. 3rd SIGdial Workshop on Discourse and Dialogue. Philadelphia, PA, 142–153.Google Scholar
  21. 21.
    Pallett, D., Fourcin, A. (1997). Speech input: Assessment and evaluation. In: Survey of the State of the Art in Human Language Technology, Cambridge University Press and Giardini Editori, Pisa, 425–429.Google Scholar
  22. 22.
    Pallett, D., Fiscus, J., Fisher, W., Garofolo, J. (1993). Benchmark tests for the DARPA spoken language program. In: Proc. DARPA Human Language Technology Workshop, Princeton, NJ, 7–18.Google Scholar
  23. 23.
    Young, S. (1997). Speech recognition evaluation: A review of the ARPA CSR programme. In: Proc. Speech and Language Technology (SALT) Club Workshop on Evaluation in Speech and Language Technology, Sheffield, 197–205.Google Scholar
  24. 24.
    Pallett, D. (1998). The NIST role in automatic speech recognition benchmark tests. In: Proc. 1st Int. Conf. on Language Resources and Evaluation (LREC’98), Granada, 327–330.Google Scholar
  25. 25.
    Picone, J., Goudie-Marshall, K., Doddington, G., Fisher, W. (1986). Automatic text alignment for speech system evaluation. IEEE Trans. Acoust., Speech, Signal Process. 34(4), 780–784.Google Scholar
  26. 26.
    Picone, J., Doddington, G., Pallett, D. (1990). Phone-mediated word alignment for speech recognition evaluation. IEEE Trans. Acoust., Speech, Signal Process. 38(3), 559–562.Google Scholar
  27. 27.
    Strik, H., Cucchiarini, C., Kessens, J. (2000). Comparing the recognition performance of CSRs: In search of an adequate metric and statistical significance test. In: Proc. 6th Int. Conf. on Spoken Language Processing (ICSLP2000), Beijing, 740–743.Google Scholar
  28. 28.
    Strik, H., Cucchiarini, C., Kessens, J. (2001). Comparing the performance of two CSRs: How to determine the significance level of the differences. In: Proc. 7th Eur. Conf. on Speech Communication and Technology (EUROSPEECH 2001 – Scandinavia), Aalborg, 2091–2094.Google Scholar
  29. 29.
    Price, P. (1990). Evaluation of spoken language systems: The ATIS domain. In: Proc. DARPA Speech and Natural Language Workshop, Hidden Valley, PA, 91–95.Google Scholar
  30. 30.
    Glass, J., Polifroni, J., Seneff, S., Zue, V. (2000). Data collection and performance evaluation of spoken dialogue systems: The MIT experience. In: Proc. 6th Int. Conf. on Spoken Language Processing (ICSLP 2000), Beijing, 1–4.Google Scholar
  31. 31.
    Grice, H. (1975). Logic and Conversation. Syntax and Semantics. Academic, New York, NY, 41–58.Google Scholar
  32. 32.
    Bernsen, N., Dybkjær, H., Dybkjær, L. (1998). Designing Interactive Speech Systems: From First Ideas to User Testing. Springer, Berlin.CrossRefGoogle Scholar
  33. 33.
    Francis, A., Nusbaum, H. (1999). Evaluating the Quality of Synthetic Speech. Human Factors and Voice Interactive Systems. Kluwer, Boston, MA, 63–97.Google Scholar
  34. 34.
    Sityaev, D., Knill, K., Burrows, T. (2006). Comparison of the ITU-T P.85 standard to other methods for the evaluation of Text-to-Speech systems. In: Proc. 9th Int. Conf. on Spoken Language Processing (Interspeech 2006 – ICSLP), Pittsburgh, PA, 1077–1080.Google Scholar
  35. 35.
    Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Comput. Speech Lang. 19(1), 55–83.Google Scholar
  36. 36.
    Nielsen, J., Mack, R. (eds) (1994). Usability Inspection Methods. Wiley, New York, NY.Google Scholar
  37. 37.
    Oulasvirta, A., Möller, S., Engelbrecht, K., Jameson, A. (2006). The relationship of user errors to perceived usability of a spoken dialogue system. In: Proc. 2nd ISCA/DEGA Tutorial and Research Workshop on Perceptual Quality of Systems, Berlin, 61–67.Google Scholar
  38. 38.
    ISO 9241-110 (2006). Ergonomics of human–system interaction. Part 110: Dialogue principles. International Organization for Standardization, Geneva.Google Scholar
  39. 39.
    Constantinides, P., Rudnicky, A. (1999). Dialog analysis in the Carnegie Mellon Communicator. In: Proc. 6th Eur. Conf. on Speech Communication and Technology (EUROSPEECH’99), Budapest, 243–246.Google Scholar
  40. 40.
    Billi, R., Castagneri, G., Danieli, M. (1996). Field trial evaluations of two different information inquiry systems. In: Proc. 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA’96), Basking Ridge, NJ, 129–134.Google Scholar
  41. 41.
    Boros, M., Eckert, W., Gallwitz, F., Gorz, G., Hanrieder, G., Niemann, H. (1996). Towards understanding spontaneous speech: Word accuracy vs. concept accuracy. In: Proc. 4th Int. Conf. on Spoken Language Processing (ICSLP’96) IEEE, Piscataway, NJ, 1009–1012.Google Scholar
  42. 42.
    Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistics. Comput. Linguist. 22(2), 249–254.Google Scholar
  43. 43.
    Cookson, S. (1988). Final evaluation of VODIS – Voice Operated Database Inquiry System. In: Proc. SPEECH’88, 7th FASE Symposium, Edinburgh, 1311–1320.Google Scholar
  44. 44.
    Danieli, M., Gerbino, E. (1995). Metrics for evaluating dialogue strategies in a spoken language system. Empirical Methods in Discourse Interpretation and Generation. Papers from the 1995 AAAI Symposium, Stanford, CA. AAAI Press, Menlo Park, CA, 34–39.Google Scholar
  45. 45.
    Gerbino, E., Baggia, P., Ciaramella, A., Rullent, C. (1993). Test and evaluation of a spoken dialogue system. In: Proc. Int. Conf. on Acoustics Speech and Signal Processing (ICASSP’93), IEEE, Piscataway, NJ, 135–138.Google Scholar
  46. 46.
    Goodine, D., Hirschman, L., Polifroni, J., Seneff, S., Zue, V. (1992). Evaluating interactive spoken language systems. In: Proc. 2nd Int. Conf. on Spoken Language Processing (ICSLP’92), Banff, 201–204.Google Scholar
  47. 47.
    Hirschman, L., Pao, C. (1993). The cost of errors in a spoken language system. In: Proc. 3rd Eur. Conf. on Speech Communication and Technology (EUROSPEECH’93), Berlin, 1419–1422.Google Scholar
  48. 48.
    Kamm, C., Litman, D., Walker, M. (1998). From novice to expert: The effect of tutorials on user expertise with spoken dialogue systems. In: Proc. 5th Int. Conf. on Spoken Language Processing (ICSLP’98), Sydney, 1211–1214.Google Scholar
  49. 49.
    Polifroni, J., Hirschman, L., Seneff, S., Zue, V. (1992). Experiments in evaluating interactive spoken language systems. In: Proc. DARPA Speech and Natural Language Workshop, Harriman, CA, 28–33.Google Scholar
  50. 50.
    Price, P., Hirschman, L., Shriberg, E., Wade, E. (1992). Subject-based evaluation measures for interactive spoken language systems. In: Proc. DARPA Speech and Natural Language Workshop, Harriman, CA, 34–39.Google Scholar
  51. 51.
    San-Segundo, R., Montero, J., Colás, J., Gutiérrez, J., Ramos, J., Pardo, J. (2001). Methodology for dialogue design in telephone-based spoken dialogue systems: A Spanish train information system. In: Proc. 7th Eur. Conf. on Speech Communication and Technology (EUROSPEECH 2001–Scandinavia), Aalborg, 2165–2168.Google Scholar
  52. 52.
    Simpson, A., Fraser, N. (1993). Black box and glass box evaluation of the SUNDIAL system. In: Proc. 3rd Eur. Conf. on Speech Communication and Technology (EUROSPEECH’93), Berlin, 1423–1426.Google Scholar
  53. 53.
    Skowronek, J. (2002). Entwicklung von Modellierungsansätzen zur Vorhersage der Dienstequalität bei der Interaktion mit einem natürlichsprachlichen Dialogsystem. Diploma thesis (unpublished), Institut für Kommunikationsakustik, Ruhr-Universität Bochum.Google Scholar
  54. 54.
    Walker, M., Litman, D., Kamm, C., Abella, A. (1997). PARADISE: A framework for evaluating spoken dialogue agents. In: Proc. of the ACL/EACL 35th Ann. Meeting of the Assoc. for Computational Linguistics, Madrid, 271–280.Google Scholar
  55. 55.
    Walker, M., Litman, D., Kamm, C., Abella, A. (1998). Evaluating spoken dialogue agents with PARADISE: Two case studies. Comput. Speech Lang. 12(4), 317–347.Google Scholar
  56. 56.
    Zue, V., Seneff, S., Glass, J., Polifroni, J., Pao, C., Hazen, T., Hetherington, L. (2000). JUPITER: A telephone-based conversational interface for weather information. IEEE Trans. Speech Audio Process. 8(1), 85–96.Google Scholar
  57. 57.
    Hone, K., Graham, R. (2000). Towards a tool for the subjective assessment of speech system interfaces (SASSI). Nat. Lang. Eng. 6(3–4), 287–303.Google Scholar
  58. 58.
    Hone, K. S., Graham, R. (2001). Subjective assessment of speech-system interface usability. In: Proc. 7th Eur. Conf. on Speech Communication and Technology (EUROSPEECH 2001–Scandinavia), Aalborg, 2083–2086.Google Scholar
  59. 59.
    Möller, S., Smeele, P., Boland, H., Krebber, J. (2007). Evaluating spoken dialogue systems according to de-facto standards: A case study. Comput. Speech Lang. 21(1), 26–53.Google Scholar
  60. 60.
    Möller, S. Smeele, P., Boland, H., Krebber, J. (2006). Messung und Vorhersage der Effizienz bei der Interaktion mit Sprachdialogdiensten. In: Fortschritte der Akustik - DAGA 2006: Plenarvortr., Braunschweig, 463–464.Google Scholar
  61. 61.
    Walker, M., Kamm, C., Litman, D. (2000). Towards developing general models of usability with PARADISE. Nat. Lang. Eng. 6(3–4), 363–377.Google Scholar
  62. 62.
    Walker, M., Kamm, C., Litman, D. (2005). Towards generic quality prediction models for spoken dialogue systems – A case study. In: Proc. 9th Eur. Conf. on Speech Communication and Technology (Interspeech 2005), Lisboa, 2489–2492.Google Scholar
  63. 63.
    Dybkjær, L., Bernsen, N. O., Minker, W. (2004). Evaluation and usability of multimodal spoken language dialogue systems. Speech Commun. 43(1–2), 33–54.Google Scholar
  64. 64.
    Beringer, N., Louka, K., Penide-Lopez, V., Türk, U. (2002). End-to-end evaluation of multimodal dialogue systems: Can we transfer established methods? In: Proc. 3rd Int. Conf. on Language Resources and Evaluation (LREC 2002), Las Palmas, 558–563.Google Scholar
  65. 65.
    Bernsen, N., Dybkjær, L., Kiilerich, S. (2004). Evaluating conversation with Hans Christian Andersen. In: Proc. 4th Int. Conf. on Language Resources and Evaluation (LREC 2004), Lisbon, 1011–1014.Google Scholar
  66. 66.
    Araki, M., Doshita, S. (1997). Automatic evaluation environment for spoken dialogue systems. In: Dialogue Processing in Spoken Language Systems. Proc. ECAI’96 Workshop, Budapest. Springer, Berlin, 183–194.Google Scholar
  67. 67.
    López-Cozar, R., de la Torre, A., Segura, J., Rubio, A. (2003). Assessment of dialogue systems by means of a new simulation technique. Speech Commun. 40(3), 387–407.Google Scholar
  68. 68.
    Walker, M. (1994). Experimentally evaluating communicative strategies: The effect of the task. In: Proc. Conf. Am. Assoc. Artificial Intelligence (AAAI’94), Assoc. for Computing Machinery (ACM), New York, NY, 86–93.Google Scholar
  69. 69.
    Walker, M. (1992). Risk Taking and Recovery in Task-Oriented Dialogue. PhD thesis, University of Edinburgh.Google Scholar
  70. 70.
    Möller, S., Englert, R., Engelbrecht, K., Hafner, V., Jameson, A., Oulasvirta, A., Raake, A., Reithinger, N. (2006). MeMo: Towards automatic usability evaluation of spoken dialogue services by user error simulations. In: Proc. 9th Int. Conf. on Spoken Language Processing (Interspeech 2006 – ICSLP), Pittsburgh, PA, 1786–1789.Google Scholar
  71. 71.
    Möller, S., Heimansberg, J. Estimation of TTS quality in telephone environments using a reference-free quality prediction model. In: Proc. 2nd ISCA/DEGA Tutorial and Research Workshop on Perceptual Quality of Systems, Berlin, 56–60.Google Scholar
  72. 72.
    Compagnoni, B. (2006). Development of Prediction Models for the Quality of Spoken Dialogue Systems. Diploma thesis (unpublished), IfN, TU Braunschweig.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Deutsche Telekom LaboratoriesTechnische UniversitätBerlinGermany

Personalised recommendations