Abstract
We analyze a corpus of system-user dialogues in the Internet of Things domain. Our corpus is automatically, semi-automatically, and manually annotated with a variety of features both on the utterance level and the full dialogue level. The corpus also includes human ratings of dialogue quality collected via crowdsourcing. We calculate correlations between features and human ratings to identify which features are highly associated with human perceptions about dialogue quality in this domain. We also perform linear regression and derive a variety of dialogue quality evaluation functions. These evaluation functions are then applied to a held-out portion of our corpus, and are shown to be highly predictive of human ratings and outperform standard reward-based evaluation functions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Artstein R, Gandhe S, Gerten J, Leuski A, Traum D (2009) Semi-formal evaluation of conversational characters. In: Grumberg O, Kaminski, M, Katz, S, Wintner S (eds) Languages: from formal to natural. Essays dedicated to Nissim Francez on the occasion of his 65th birthday. Lecture Notes in Computer Science 5533. Springer, pp 22–35
Foster ME, Giuliani M, Knoll A (2009) Comparing objective and subjective measures of usability in a human-robot dialogue system. In: Proceedings of ACL, pp 879–887. Suntec, Singapore
Galley M, Brockett C, Sordoni A, Ji Y, Auli M, Quirk C, Mitchell M, Gao J, Dolan B (2015) DeltaBLEU: a discriminative metric for generation tasks with intrinsically diverse targets. In: Proceedings of ACL (short papers), pp 445–450. Beijing, China
Gandhe S, Traum D (2008) Evaluation understudy for dialogue coherence models. In: Proceedings of SIGDIAL, pp 172–181. Columbus, Ohio, USA
Georgila K, Henderson J, Lemon O (2005) Learning user simulations for information state update dialogue systems. In: Proceedings of Interspeech, pp 893–896. Lisbon, Portugal
Georgila K, Henderson J, Lemon O (2006) User simulation for spoken dialogue systems: learning and evaluation. In: Proceedings of Interspeech, pp 1065–1068. Pittsburgh, Pennsylvania, USA
Guo F, Metallinou A, Khatri C, Raju A, Venkatesh A, Ram A (2017) Topic-based evaluation for conversational bots. In: Proceedings of NIPS Workshop on Conversational AI: Today’s Practice and Tomorrow’s Potential. Long Beach, California, USA
Hastie H (2012) Metrics and evaluation of spoken dialogue systems. In: Lemon O, Pietquin O (eds) Data-driven methods for adaptive spoken dialogue systems. Springer, pp 131–150
Henderson J, Lemon O, Georgila K (2008) Hybrid reinforcement/supervised learning of dialogue policies from fixed datasets. Comput Linguist 34(4):487–511
Hone KS, Graham R (2000) Towards a tool for the subjective assessment of speech system interfaces (SASSI). J Nat Lang Eng 6(3–4):287–303
Jeon H, Oh HR, Hwang I, Kim J (2016) An intelligent dialogue agent for the IoT home. In: Proceedings of the AAAI Workshop on Artificial Intelligence Applied to Assistive Technologies and Smart Environments, pp 35–40. Phoenix, Arizona, USA
Jung S, Lee C, Kim K, Jeong M, Lee GG (2009) Data-driven user simulation for automated evaluation of spoken dialog systems. Comput Speech Lang 23(4):479–509
Liu CW, Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of EMNLP, pp 2122–2132. Austin, Texas, USA
Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) On the evaluation of dialogue systems with next utterance classification. In: Proceedings of SIGDIAL, pp 264–269. Los Angeles, California, USA
Paksima T, Georgila K, Moore JD (2009) Evaluating the effectiveness of information presentation in a full end-to-end dialogue system. In: Proceedings of SIGDIAL, pp 1–10. London, UK
Purandare A, Litman D (2008) Analyzing dialog coherence using transition patterns in lexical and semantic features. In: Proceedings of FLAIRS, pp 195–200. Coconut Grove, Florida, USA
Robinson S, Roque A, Traum D (2010) Dialogues in context: an objective user-oriented evaluation approach for virtual human dialogue. In: Proceedings of LREC, pp 64–71. Valletta, Malta
Schatzmann J, Georgila K, Young S (2005) Quantitative evaluation of user simulation techniques for spoken dialogue systems. In: Proceedings of SIGDIAL, pp 45–54. Lisbon, Portugal
Traum DR, Robinson S, Stephan J (2004) Evaluation of multi-party virtual reality dialogue interaction. In: Proceedings of LREC, pp 1699–1702. Lisbon, Portugal
Walker M, Kamm C, Litman D (2000) Towards developing general models of usability with PARADISE. J Nat Lang Eng 6(3–4):363–377
Acknowledgements
This work was funded by Samsung Electronics Co., Ltd. Some of the authors were partly supported by the U.S. Army Research Laboratory. Any statements or opinions expressed in this material are those of the authors and do not necessarily reflect the policy of the U.S. Government, and no official endorsement should be inferred.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Georgila, K., Gordon, C., Choi, H., Boberg, J., Jeon, H., Traum, D. (2019). Toward Low-Cost Automated Evaluation Metrics for Internet of Things Dialogues. In: D'Haro, L., Banchs, R., Li, H. (eds) 9th International Workshop on Spoken Dialogue System Technology. Lecture Notes in Electrical Engineering, vol 579. Springer, Singapore. https://doi.org/10.1007/978-981-13-9443-0_14
Download citation
DOI: https://doi.org/10.1007/978-981-13-9443-0_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9442-3
Online ISBN: 978-981-13-9443-0
eBook Packages: Literature, Cultural and Media StudiesLiterature, Cultural and Media Studies (R0)