Toward Low-Cost Automated Evaluation Metrics for Internet of Things Dialogues

Georgila, Kallirroi; Gordon, Carla; Choi, Hyungtak; Boberg, Jill; Jeon, Heesik; Traum, David

doi:10.1007/978-981-13-9443-0_14

Kallirroi Georgila³⁷,
Carla Gordon³⁷,
Hyungtak Choi³⁸,
Jill Boberg³⁷,
Heesik Jeon³⁸ &
…
David Traum³⁷

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 579))

321 Accesses
2 Citations

Abstract

We analyze a corpus of system-user dialogues in the Internet of Things domain. Our corpus is automatically, semi-automatically, and manually annotated with a variety of features both on the utterance level and the full dialogue level. The corpus also includes human ratings of dialogue quality collected via crowdsourcing. We calculate correlations between features and human ratings to identify which features are highly associated with human perceptions about dialogue quality in this domain. We also perform linear regression and derive a variety of dialogue quality evaluation functions. These evaluation functions are then applied to a held-out portion of our corpus, and are shown to be highly predictive of human ratings and outperform standard reward-based evaluation functions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Artstein R, Gandhe S, Gerten J, Leuski A, Traum D (2009) Semi-formal evaluation of conversational characters. In: Grumberg O, Kaminski, M, Katz, S, Wintner S (eds) Languages: from formal to natural. Essays dedicated to Nissim Francez on the occasion of his 65th birthday. Lecture Notes in Computer Science 5533. Springer, pp 22–35
Google Scholar
Foster ME, Giuliani M, Knoll A (2009) Comparing objective and subjective measures of usability in a human-robot dialogue system. In: Proceedings of ACL, pp 879–887. Suntec, Singapore
Google Scholar
Galley M, Brockett C, Sordoni A, Ji Y, Auli M, Quirk C, Mitchell M, Gao J, Dolan B (2015) DeltaBLEU: a discriminative metric for generation tasks with intrinsically diverse targets. In: Proceedings of ACL (short papers), pp 445–450. Beijing, China
Google Scholar
Gandhe S, Traum D (2008) Evaluation understudy for dialogue coherence models. In: Proceedings of SIGDIAL, pp 172–181. Columbus, Ohio, USA
Google Scholar
Georgila K, Henderson J, Lemon O (2005) Learning user simulations for information state update dialogue systems. In: Proceedings of Interspeech, pp 893–896. Lisbon, Portugal
Google Scholar
Georgila K, Henderson J, Lemon O (2006) User simulation for spoken dialogue systems: learning and evaluation. In: Proceedings of Interspeech, pp 1065–1068. Pittsburgh, Pennsylvania, USA
Google Scholar
Guo F, Metallinou A, Khatri C, Raju A, Venkatesh A, Ram A (2017) Topic-based evaluation for conversational bots. In: Proceedings of NIPS Workshop on Conversational AI: Today’s Practice and Tomorrow’s Potential. Long Beach, California, USA
Google Scholar
Hastie H (2012) Metrics and evaluation of spoken dialogue systems. In: Lemon O, Pietquin O (eds) Data-driven methods for adaptive spoken dialogue systems. Springer, pp 131–150
Google Scholar
Henderson J, Lemon O, Georgila K (2008) Hybrid reinforcement/supervised learning of dialogue policies from fixed datasets. Comput Linguist 34(4):487–511
Article Google Scholar
Hone KS, Graham R (2000) Towards a tool for the subjective assessment of speech system interfaces (SASSI). J Nat Lang Eng 6(3–4):287–303
Article Google Scholar
Jeon H, Oh HR, Hwang I, Kim J (2016) An intelligent dialogue agent for the IoT home. In: Proceedings of the AAAI Workshop on Artificial Intelligence Applied to Assistive Technologies and Smart Environments, pp 35–40. Phoenix, Arizona, USA
Google Scholar
Jung S, Lee C, Kim K, Jeong M, Lee GG (2009) Data-driven user simulation for automated evaluation of spoken dialog systems. Comput Speech Lang 23(4):479–509
Article Google Scholar
Liu CW, Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of EMNLP, pp 2122–2132. Austin, Texas, USA
Google Scholar
Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) On the evaluation of dialogue systems with next utterance classification. In: Proceedings of SIGDIAL, pp 264–269. Los Angeles, California, USA
Google Scholar
Paksima T, Georgila K, Moore JD (2009) Evaluating the effectiveness of information presentation in a full end-to-end dialogue system. In: Proceedings of SIGDIAL, pp 1–10. London, UK
Google Scholar
Purandare A, Litman D (2008) Analyzing dialog coherence using transition patterns in lexical and semantic features. In: Proceedings of FLAIRS, pp 195–200. Coconut Grove, Florida, USA
Google Scholar
Robinson S, Roque A, Traum D (2010) Dialogues in context: an objective user-oriented evaluation approach for virtual human dialogue. In: Proceedings of LREC, pp 64–71. Valletta, Malta
Google Scholar
Schatzmann J, Georgila K, Young S (2005) Quantitative evaluation of user simulation techniques for spoken dialogue systems. In: Proceedings of SIGDIAL, pp 45–54. Lisbon, Portugal
Google Scholar
Traum DR, Robinson S, Stephan J (2004) Evaluation of multi-party virtual reality dialogue interaction. In: Proceedings of LREC, pp 1699–1702. Lisbon, Portugal
Google Scholar
Walker M, Kamm C, Litman D (2000) Towards developing general models of usability with PARADISE. J Nat Lang Eng 6(3–4):363–377
Article Google Scholar

Download references

Acknowledgements

This work was funded by Samsung Electronics Co., Ltd. Some of the authors were partly supported by the U.S. Army Research Laboratory. Any statements or opinions expressed in this material are those of the authors and do not necessarily reflect the policy of the U.S. Government, and no official endorsement should be inferred.

Author information

Authors and Affiliations

USC Institute for Creative Technologies, Playa Vista, USA
Kallirroi Georgila, Carla Gordon, Jill Boberg & David Traum
Samsung Electronics Co., Ltd., Seocho-gu, Seoul, Korea
Hyungtak Choi & Heesik Jeon

Authors

Kallirroi Georgila
View author publications
You can also search for this author in PubMed Google Scholar
Carla Gordon
View author publications
You can also search for this author in PubMed Google Scholar
Hyungtak Choi
View author publications
You can also search for this author in PubMed Google Scholar
Jill Boberg
View author publications
You can also search for this author in PubMed Google Scholar
Heesik Jeon
View author publications
You can also search for this author in PubMed Google Scholar
David Traum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kallirroi Georgila .

Editor information

Editors and Affiliations

Universidad Politécnica de Madrid, Madrid, Spain
Luis Fernando D'Haro
Nanyang Technological University, Singapore, Singapore
Rafael E. Banchs
Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore
Haizhou Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Georgila, K., Gordon, C., Choi, H., Boberg, J., Jeon, H., Traum, D. (2019). Toward Low-Cost Automated Evaluation Metrics for Internet of Things Dialogues. In: D'Haro, L., Banchs, R., Li, H. (eds) 9th International Workshop on Spoken Dialogue System Technology. Lecture Notes in Electrical Engineering, vol 579. Springer, Singapore. https://doi.org/10.1007/978-981-13-9443-0_14

Download citation

DOI: https://doi.org/10.1007/978-981-13-9443-0_14
Published: 25 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9442-3
Online ISBN: 978-981-13-9443-0
eBook Packages: Literature, Cultural and Media StudiesLiterature, Cultural and Media Studies (R0)

Publish with us

Policies and ethics