From Eliza to XiaoIce: challenges and opportunities with social chatbots

Review
  • 187 Downloads

Abstract

Conversational systems have come a long way since their inception in the 1960s. After decades of research and development, we have seen progress from Eliza and Parry in the 1960s and 1970s, to task-completion systems as in the Defense Advanced Research Projects Agency (DARPA) communicator program in the 2000s, to intelligent personal assistants such as Siri, in the 2010s, to today’s social chatbots like XiaoIce. Social chatbots’ appeal lies not only in their ability to respond to users’ diverse requests, but also in being able to establish an emotional connection with users. The latter is done by satisfying users’ need for communication, affection, as well as social belonging. To further the advancement and adoption of social chatbots, their design must focus on user engagement and take both intellectual quotient (IQ) and emotional quotient (EQ) into account. Users should want to engage with a social chatbot; as such, we define the success metric for social chatbots as conversation-turns per session (CPS). Using XiaoIce as an illustrative example, we discuss key technologies in building social chatbots from core chat to visual awareness to skills. We also show how XiaoIce can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses. As we become the first generation of humans ever living with artificial intelligenc (AI), we have a responsibility to design social chatbots to be both useful and empathetic, so they will become ubiquitous and help society as a whole.

Keywords

Conversational system Social Chatbot Intelligent personal assistant Artificial intelligence XiaoIce 

CLC number

TP391 

References

  1. Alam F, Danieli M, Riccardi G, 2017. Annotating and modeling empathy in spoken conversations. Comput Speech Lang, 50:40–61. https://doi.org/10.1016/j.csl.2017.12.003CrossRefGoogle Scholar
  2. Andreani G, di Fabbrizio G, Gilbert M, et al., 2006. Let’s DISCOH: collecting an annotated open corpus with dialogue acts and reward signals for natural language helpdesks. Proc IEEE Spoken Language Technology Workshop, p.218–221. https://doi.org/10.1109/SLT.2006.326794Google Scholar
  3. Bahdanau D, Cho K, Bengio Y, 2014. Neural machine translation by jointly learning to align and translate. https://arxiv.org/abs/1409.0473Google Scholar
  4. Beldoch M, 1964. Sensitivity to expression of emotional meaning in three modes of communication. In: Davitz JR (Ed.), The Communication of Emotional Meaning. McGraw-Hill, New York, p.31–42.Google Scholar
  5. Bengio Y, Ducharme R, Vincent P, et al., 2003. A neural probabilistic language model. Proc Neural Information Processing Systems, p.1137–1155.Google Scholar
  6. Chen HM, Sun MS, Tu CC, et al., 2016. Neural sentiment classification with user and product attention. Proc Conf on Empirical Methods in Natural Language Processing, p.1650–1659.Google Scholar
  7. Colby KM, 1975. Artificial Paranoia: a Computer Simulation of Paranoid Processes. Pergamon Press INC. Maxwell House, New York, NY, England.Google Scholar
  8. Dahl DA, Bates M, Brown M, et al., 1994. Expanding the scope of the ATIS task: the ATIS-3 corpus. Proc Workshop on Human Language Technology, p.43–48. https://doi.org/10.3115/1075812.1075823CrossRefGoogle Scholar
  9. Deng L, Li JY, Huang JT, et al., 2013. Recent advances in deep learning for speech research at Microsoft. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8604–8608. https://doi.org/10.1109/ICASSP.2013.6639345Google Scholar
  10. Elkahky AM, Song Y, He XD, 2015. A multi-view deep learning approach for cross domain user modeling in recommendation systems. Proc 24th Int Conf on World Wide Web, p.278–288. https://doi.org/10.1145/2736277.2741667Google Scholar
  11. Fang H, Gupta S, Iandola F, et al., 2015. From captions to visual concepts and back. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1473–1482. https://doi.org/10.1109/CVPR.2015.7298754Google Scholar
  12. Fung P, Bertero D, Wan Y, et al., 2016. Towards empathetic human-robot interactions. Proc 17th Int Conf on Intelligent Text and Computational Linguistics.Google Scholar
  13. Gan C, Gan Z, He XD, et al., 2017, StyleNet: generating attractive visual captions with styles. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3137–3146. https://doi.org/10.1109/CVPR.2017.108Google Scholar
  14. Gardner H, 1983. Frames of Mind: the Theory of Multiple Intelligences. Basic Books, New York. https://doi.org/10.2307/3324261Google Scholar
  15. Glass J, Flammia G, Goodine D, et al., 1995. Multilingual spoken-language understanding in the MIT Voyager system. Speech Commun, 17(1):1–18. https://doi.org/10.1016/0167-6393(95)00008-CCrossRefGoogle Scholar
  16. Goleman D, 1995. Emotional Intelligence: Why It Can Matter More than IQ. Bloomsbury, Inc., New York, NY, England.Google Scholar
  17. Goleman D, 1998. Working with Emotional Intelligence. Bloomsbury, Inc., New York, NY, England.Google Scholar
  18. Güzeldere G, Franchi S, 1995. Dialogues with colorful “personalities” of early AI. Stanford Human Rev, 4(2):161–169.Google Scholar
  19. He KM, Zhang YX, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770–778. https://doi.org/10.1109/CVPR.2016.90Google Scholar
  20. He XD, Deng L, 2013. Speech-centric information processing: an optimization-oriented approach. Proc IEEE, 101(5): 116–1135. https://doi.org/10.1109/JPROC.2012.2236631CrossRefGoogle Scholar
  21. He XD, Deng L, 2017. Deep learning for image-to-text generation: a technical overview. IEEE Signal Process Mag, 34(6):109–116. https://doi.org/10.1109/MSP.2017.2741510CrossRefGoogle Scholar
  22. Hemphill CT, Godfrey JJ, Doddington GR, 1990. The ATIS spoken language systems pilot corpus. Proc Workshop on Speech and Natural Language, p.96–101. https://doi.org/10.3115/116580.116613CrossRefGoogle Scholar
  23. Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597CrossRefGoogle Scholar
  24. Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735CrossRefGoogle Scholar
  25. Huang PS, He XD, Gao JF, et al., 2013. Learning deep structured semantic models for web search using click through data. Proc 22nd ACM Int Conf on Information & Knowledge Management, p.2333–2338. https://doi.org/10.1145/2505515.2505665Google Scholar
  26. Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128–3137. https://doi.org/10.1109/CVPR.2015.7298932Google Scholar
  27. Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25th Int Conf on Neural Information Processing Systems, p.1097–1105.Google Scholar
  28. Levin E, Narayanan S, Pieraccini R, et al., 2000. The ATT-DARPA ommunicator mixed-initiative spoken dialog system. 6th Int Conf on Spoken Language Processing.Google Scholar
  29. Li JW, Galley M, Brockett C, et al., 2016. A persona-based neural conversation model. Proc 54th Annual Meeting of the Association for Computational Linguistics, p.944–1003.Google Scholar
  30. Li X, Mou LL, Yan R, et al., 2016. Stalematebreaker: a proactive content-introducing approach to automatic humancomputer conversation. Proc 25th Int Joint Conf on Artificial Intelligence, p.2845–2851.Google Scholar
  31. Liu XD, Gao JF, He XD, et al., 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. Proc Annual Conf on North American Chapter of the ACL, p.912–921.Google Scholar
  32. Lu ZD, Li H, 2013. A deep architecture for matching short texts. Proc Int Conf on Neural Information Processing Systems, p.1367–1375.Google Scholar
  33. Maslow AH, 1943. A theory of human motivation. Psychol Rev, 50(4):370–396.CrossRefGoogle Scholar
  34. Mathews A, Xie LX, He XM, 2016. SentiCap: generating image descriptions with sentiments. Proc 30th AAAI Conf on Artificial Intelligence, p.3574–3580.Google Scholar
  35. Mesnil G, He X, Deng L, et al., 2013. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. Interspeech, p.3771–3775.Google Scholar
  36. Mesnil G, Dauphin Y, Yao KS, et al., 2015. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans Audio Speech Lang Process, 23(3):530–539. https://doi.org/10.1109/TASLP.2014.2383614CrossRefGoogle Scholar
  37. Mikolov T, Sutskever I, Chen K, et al., 2013. Distributed representations of words and phrases and their compositionality. Proc 26th Int Conf on Neural Information Processing Systems, p.3111–3119.Google Scholar
  38. Mower E, Mataric MJ, Narayanan S, 2011. A framework for automatic human emotion classification using emotion profiles. IEEE Trans Audio Speech Lang Process, 19(5): 1057–1070. https://doi.org/10.1109/TASL.2010.2076804CrossRefGoogle Scholar
  39. Murphy KR, 2007. A critique of emotional intelligence: what are the problems and how can they be fixed? Pers Psychol-, 60(1):235–238. https://doi.org/10.1111/j.1744-6570.2007.00071_2.xMathSciNetCrossRefGoogle Scholar
  40. Price PJ, 1990. Evaluation of spoken language systems: the ATIS domain. Proc Workshop on Speech and Natural Language, p.91–95. https://doi.org/10.3115/116580.116612CrossRefGoogle Scholar
  41. Qian Y, Fan YC, Hu WP, et al., 2014. On the training aspects of deep neural network (DNN) for parametric TTS synthesis. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3829–3833. https://doi.org/10.1109/ICASSP.2014.6854318Google Scholar
  42. Raux A, Langner B, Bohus D, et al., 2005. Let’s go public! Taking a spoken dialog system to the real world. 9th European Conf on Speech Communication and Technology, p.885–888.Google Scholar
  43. Rudnicky AI, Thayer EH, Constantinides PC, et al., 1999. Creating natural dialogs in the Carnegie Mellon communicator system. 6th European Conf on Speech Communication and Technology.Google Scholar
  44. Sarikaya R, 2017. The technology behind personal digital assistants—an overview of the system architecture and key components. IEEE Signal Process Mag, 34(1):67–81. https://doi.org/10.1109/MSP.2016.2617341CrossRefGoogle Scholar
  45. Sarikaya R, Crook PA, Marin A, et al., 2016. An overview of end-to-end language understanding and dialog management for personal digital assistants. Proc IEEE Spoken Language Technology Workshop, p.391–397. https://doi.org/10.1109/SLT.2016.7846294Google Scholar
  46. Seneff S, Hurley E, Lau R, et al., 1998. Galaxy-II: a reference architecture for conversational system development. 5th Int Conf on Spoken Language Processing.Google Scholar
  47. Serban IV, Klinger T, Tesauro G, et al., 2017. Multiresolution recurrent neural networks: an application to dialogue response generation. AAAI, p.3288–3294.Google Scholar
  48. Shawar BA, Atwell E, 2007. Different measurements metrics to evaluate a chatbot system. Proc Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, p.89–96.Google Scholar
  49. Shieber SM, 1994. Lessons from a restricted Turing test. Commun ACM, 37(6):70–78. https://doi.org/10.1145/175208.175217CrossRefGoogle Scholar
  50. Socher R, Perelygin A, Wu JY, et al., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc Conf on Empirical Methods in Natural Language Processing, p.1631–1642.Google Scholar
  51. Song R, 2018. Image to poetry by cross-modality understanding with unpaired data. Personal Communication.Google Scholar
  52. Sordoni A, Galley M, Auli M, et al., 2015. A neural network approach to context-sensitive generation of conversational responses. Proc Annual Conf on North American Chapter of the ACL, p.196–205.Google Scholar
  53. Sutskever I, Vinyals O, Le QVV, 2014. Sequence to sequence learning with neural networks. NIPS, p.1–9. https://doi.org/10.1007/s10107-014-0839-0Google Scholar
  54. Tokuhisa R, Inui K, Matsumoto Y, 2008. Emotion classification using massive examples extracted from the web. Proc 22nd Int Conf on Computational Linguistics, p.881–888.Google Scholar
  55. Tur G, de Mori R, 2011. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.CrossRefMATHGoogle Scholar
  56. Tur G, Deng L, 2011. Intent determination and spoken utterance classification. In: Tur G, de Mori R (Eds.), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.CrossRefGoogle Scholar
  57. Turing A, 1950. Computing machinery and intelligence. Mind, 59:433–460.MathSciNetCrossRefGoogle Scholar
  58. van den Oord A, Dieleman S, Zen HG, et al., 2016. WaveNet: a generative model for raw audio. 9th ISCA Speech Synthesis Workshop, p.125.Google Scholar
  59. Vinyals O, Le QV, 2015. A neural conversational model. Proc 31st Int Conf on Machine Learning.Google Scholar
  60. Vinyals O, Toshev A, Bengio S, et al., 2015. Show and tell: a neural image caption generator. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3156–3164. https://doi.org/10.1109/CVPR.2015.7298935Google Scholar
  61. Walker M, Aberdeen J, Boland J, et al., 2001. DARPA Communicator dialog travel planning systems: the June 2000 data collection. Proc 7th European Conf on Speech Communication and Technology.Google Scholar
  62. Walker M, Rudnicky AI, Aberdeen JS, et al., 2002, DARPA Communicator evaluation: progress from 2000 to 2001. Proc Int Conf on Spoken Language Processing, p. 273–276.Google Scholar
  63. Wallace RS, 2009. The anatomy of A.L.I.C.E. In: Epstein R, Roberts G, Beber G (Eds.), Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer, Dordrecht, p.181–210. https://doi.org/10.1007/978-1-4020-6710-5_13CrossRefGoogle Scholar
  64. Wang HN, He XD, Chang MW, et al., 2013. Personalized ranking model adaptation for web search. Proc 36th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.323–332. https://doi.org/10.1145/2484028.2484068Google Scholar
  65. Wang YY, Deng L, Acero A, 2011. Semantic frame-based spoken language understanding. In: Tur G, de Mori R (Eds.), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.Google Scholar
  66. Wang ZY, Wang HX, Wen JR, et al., 2015. An inference approach to basic level of categorization. Proc 24th ACM Int Conf on Information and Knowledge Management, p.653–662. https://doi.org/10.1145/2806416.2806533Google Scholar
  67. Weizenbaum J, 1966. ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM, 9(1):36–45. https://doi.org/10.1145/357980.357991CrossRefGoogle Scholar
  68. Wen TH, Vandyke D, Mrksic N, et al., 2016. A network-based end-to-end trainable task-oriented dialogue system. Proc 15th Conf on European Chapter of the Association for Computational Linguistics, p.438–449.Google Scholar
  69. Williams JD, Young S, 2007. Partially observable Markov decision processes for spoken dialog systems. Comput Speech Lang, 21(2):393–422. https://doi.org/10.1016/j.csl.2006.06.008CrossRefGoogle Scholar
  70. Xiong W, Droppo J, Huang XD, et al., 2016. Achieving human parity in conversational speech recognition. IEEE/ACM Trans Audio Speech Lang Process, in press. https://doi.org/10.1109/TASLP.2017.2756440Google Scholar
  71. Yan R, Song YP, Wu H, 2016. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. Proc 39th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.55–64. https://doi.org/10.1145/2911451.2911542Google Scholar
  72. Yang ZC, He XD, Gao JF, et al., 2016a. Stacked attention networks for image question answering. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.21–29. https://doi.org/10.1109/CVPR.2016.10Google Scholar
  73. Yang ZC, Yang DY, Dyer C, et al., 2016b. Hierarchical attention networks for document classification. Proc 15th Annual Conf on North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.1480–1489.Google Scholar
  74. Yu Z, Xu ZY, Black AW, et al., 2016. Chatbot evaluation and database expansion via crowdsourcing. Proc REWOCHAT Workshop of LREC.Google Scholar

Copyright information

© Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Microsoft CorporationRedmondUSA

Personalised recommendations