Skip to main content

Toward Low-Cost Automated Evaluation Metrics for Internet of Things Dialogues

  • Conference paper
  • First Online:
9th International Workshop on Spoken Dialogue System Technology

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 579))

Abstract

We analyze a corpus of system-user dialogues in the Internet of Things domain.  Our corpus is automatically, semi-automatically, and manually annotated with a variety of features both on the utterance level and the full dialogue level.  The corpus also includes human ratings of dialogue quality collected via crowdsourcing. We calculate correlations between features and human ratings to identify which features are highly associated with human perceptions about dialogue quality in this domain.  We also perform linear regression and derive a variety of dialogue quality evaluation functions. These evaluation functions are then applied to a held-out portion of our corpus, and are shown to be highly predictive of human ratings and outperform standard reward-based evaluation functions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Artstein R, Gandhe S, Gerten J, Leuski A, Traum D (2009) Semi-formal evaluation of conversational characters. In: Grumberg O, Kaminski, M, Katz, S, Wintner S (eds) Languages: from formal to natural. Essays dedicated to Nissim Francez on the occasion of his 65th birthday. Lecture Notes in Computer Science 5533. Springer, pp 22–35

    Google Scholar 

  2. Foster ME, Giuliani M, Knoll A (2009) Comparing objective and subjective measures of usability in a human-robot dialogue system. In: Proceedings of ACL, pp 879–887. Suntec, Singapore

    Google Scholar 

  3. Galley M, Brockett C, Sordoni A, Ji Y, Auli M, Quirk C, Mitchell M, Gao J, Dolan B (2015) DeltaBLEU: a discriminative metric for generation tasks with intrinsically diverse targets. In: Proceedings of ACL (short papers), pp 445–450. Beijing, China

    Google Scholar 

  4. Gandhe S, Traum D (2008) Evaluation understudy for dialogue coherence models. In: Proceedings of SIGDIAL, pp 172–181. Columbus, Ohio, USA

    Google Scholar 

  5. Georgila K, Henderson J, Lemon O (2005) Learning user simulations for information state update dialogue systems. In: Proceedings of Interspeech, pp 893–896. Lisbon, Portugal

    Google Scholar 

  6. Georgila K, Henderson J, Lemon O (2006) User simulation for spoken dialogue systems: learning and evaluation. In: Proceedings of Interspeech, pp 1065–1068. Pittsburgh, Pennsylvania, USA

    Google Scholar 

  7. Guo F, Metallinou A, Khatri C, Raju A, Venkatesh A, Ram A (2017) Topic-based evaluation for conversational bots. In: Proceedings of NIPS Workshop on Conversational AI: Today’s Practice and Tomorrow’s Potential. Long Beach, California, USA

    Google Scholar 

  8. Hastie H (2012) Metrics and evaluation of spoken dialogue systems. In: Lemon O, Pietquin O (eds) Data-driven methods for adaptive spoken dialogue systems. Springer, pp 131–150

    Google Scholar 

  9. Henderson J, Lemon O, Georgila K (2008) Hybrid reinforcement/supervised learning of dialogue policies from fixed datasets. Comput Linguist 34(4):487–511

    Article  Google Scholar 

  10. Hone KS, Graham R (2000) Towards a tool for the subjective assessment of speech system interfaces (SASSI). J Nat Lang Eng 6(3–4):287–303

    Article  Google Scholar 

  11. Jeon H, Oh HR, Hwang I, Kim J (2016) An intelligent dialogue agent for the IoT home. In: Proceedings of the AAAI Workshop on Artificial Intelligence Applied to Assistive Technologies and Smart Environments, pp 35–40. Phoenix, Arizona, USA

    Google Scholar 

  12. Jung S, Lee C, Kim K, Jeong M, Lee GG (2009) Data-driven user simulation for automated evaluation of spoken dialog systems. Comput Speech Lang 23(4):479–509

    Article  Google Scholar 

  13. Liu CW, Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of EMNLP, pp 2122–2132. Austin, Texas, USA

    Google Scholar 

  14. Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) On the evaluation of dialogue systems with next utterance classification. In: Proceedings of SIGDIAL, pp 264–269. Los Angeles, California, USA

    Google Scholar 

  15. Paksima T, Georgila K, Moore JD (2009) Evaluating the effectiveness of information presentation in a full end-to-end dialogue system. In: Proceedings of SIGDIAL, pp 1–10. London, UK

    Google Scholar 

  16. Purandare A, Litman D (2008) Analyzing dialog coherence using transition patterns in lexical and semantic features. In: Proceedings of FLAIRS, pp 195–200. Coconut Grove, Florida, USA

    Google Scholar 

  17. Robinson S, Roque A, Traum D (2010) Dialogues in context: an objective user-oriented evaluation approach for virtual human dialogue. In: Proceedings of LREC, pp 64–71. Valletta, Malta

    Google Scholar 

  18. Schatzmann J, Georgila K, Young S (2005) Quantitative evaluation of user simulation techniques for spoken dialogue systems. In: Proceedings of SIGDIAL, pp 45–54. Lisbon, Portugal

    Google Scholar 

  19. Traum DR, Robinson S, Stephan J (2004) Evaluation of multi-party virtual reality dialogue interaction. In: Proceedings of LREC, pp 1699–1702. Lisbon, Portugal

    Google Scholar 

  20. Walker M, Kamm C, Litman D (2000) Towards developing general models of usability with PARADISE. J Nat Lang Eng 6(3–4):363–377

    Article  Google Scholar 

Download references

Acknowledgements

This work was funded by Samsung Electronics Co., Ltd. Some of the authors were partly supported by the U.S. Army Research Laboratory. Any statements or opinions expressed in this material are those of the authors and do not necessarily reflect the policy of the U.S. Government, and no official endorsement should be inferred.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kallirroi Georgila .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Georgila, K., Gordon, C., Choi, H., Boberg, J., Jeon, H., Traum, D. (2019). Toward Low-Cost Automated Evaluation Metrics for Internet of Things Dialogues. In: D'Haro, L., Banchs, R., Li, H. (eds) 9th International Workshop on Spoken Dialogue System Technology. Lecture Notes in Electrical Engineering, vol 579. Springer, Singapore. https://doi.org/10.1007/978-981-13-9443-0_14

Download citation

Publish with us

Policies and ethics