Skip to main content
Log in

From Eliza to XiaoIce: challenges and opportunities with social chatbots

  • Review
  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

Conversational systems have come a long way since their inception in the 1960s. After decades of research and development, we have seen progress from Eliza and Parry in the 1960s and 1970s, to task-completion systems as in the Defense Advanced Research Projects Agency (DARPA) communicator program in the 2000s, to intelligent personal assistants such as Siri, in the 2010s, to today’s social chatbots like XiaoIce. Social chatbots’ appeal lies not only in their ability to respond to users’ diverse requests, but also in being able to establish an emotional connection with users. The latter is done by satisfying users’ need for communication, affection, as well as social belonging. To further the advancement and adoption of social chatbots, their design must focus on user engagement and take both intellectual quotient (IQ) and emotional quotient (EQ) into account. Users should want to engage with a social chatbot; as such, we define the success metric for social chatbots as conversation-turns per session (CPS). Using XiaoIce as an illustrative example, we discuss key technologies in building social chatbots from core chat to visual awareness to skills. We also show how XiaoIce can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses. As we become the first generation of humans ever living with artificial intelligenc (AI), we have a responsibility to design social chatbots to be both useful and empathetic, so they will become ubiquitous and help society as a whole.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alam F, Danieli M, Riccardi G, 2017. Annotating and modeling empathy in spoken conversations. Comput Speech Lang, 50:40–61. https://doi.org/10.1016/j.csl.2017.12.003

    Article  Google Scholar 

  • Andreani G, di Fabbrizio G, Gilbert M, et al., 2006. Let’s DISCOH: collecting an annotated open corpus with dialogue acts and reward signals for natural language helpdesks. Proc IEEE Spoken Language Technology Workshop, p.218–221. https://doi.org/10.1109/SLT.2006.326794

    Google Scholar 

  • Bahdanau D, Cho K, Bengio Y, 2014. Neural machine translation by jointly learning to align and translate. https://arxiv.org/abs/1409.0473

    Google Scholar 

  • Beldoch M, 1964. Sensitivity to expression of emotional meaning in three modes of communication. In: Davitz JR (Ed.), The Communication of Emotional Meaning. McGraw-Hill, New York, p.31–42.

    Google Scholar 

  • Bengio Y, Ducharme R, Vincent P, et al., 2003. A neural probabilistic language model. Proc Neural Information Processing Systems, p.1137–1155.

    Google Scholar 

  • Chen HM, Sun MS, Tu CC, et al., 2016. Neural sentiment classification with user and product attention. Proc Conf on Empirical Methods in Natural Language Processing, p.1650–1659.

    Google Scholar 

  • Colby KM, 1975. Artificial Paranoia: a Computer Simulation of Paranoid Processes. Pergamon Press INC. Maxwell House, New York, NY, England.

    Google Scholar 

  • Dahl DA, Bates M, Brown M, et al., 1994. Expanding the scope of the ATIS task: the ATIS-3 corpus. Proc Workshop on Human Language Technology, p.43–48. https://doi.org/10.3115/1075812.1075823

    Chapter  Google Scholar 

  • Deng L, Li JY, Huang JT, et al., 2013. Recent advances in deep learning for speech research at Microsoft. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8604–8608. https://doi.org/10.1109/ICASSP.2013.6639345

    Google Scholar 

  • Elkahky AM, Song Y, He XD, 2015. A multi-view deep learning approach for cross domain user modeling in recommendation systems. Proc 24th Int Conf on World Wide Web, p.278–288. https://doi.org/10.1145/2736277.2741667

    Google Scholar 

  • Fang H, Gupta S, Iandola F, et al., 2015. From captions to visual concepts and back. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1473–1482. https://doi.org/10.1109/CVPR.2015.7298754

    Google Scholar 

  • Fung P, Bertero D, Wan Y, et al., 2016. Towards empathetic human-robot interactions. Proc 17th Int Conf on Intelligent Text and Computational Linguistics.

    Google Scholar 

  • Gan C, Gan Z, He XD, et al., 2017, StyleNet: generating attractive visual captions with styles. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3137–3146. https://doi.org/10.1109/CVPR.2017.108

    Google Scholar 

  • Gardner H, 1983. Frames of Mind: the Theory of Multiple Intelligences. Basic Books, New York. https://doi.org/10.2307/3324261

    Google Scholar 

  • Glass J, Flammia G, Goodine D, et al., 1995. Multilingual spoken-language understanding in the MIT Voyager system. Speech Commun, 17(1):1–18. https://doi.org/10.1016/0167-6393(95)00008-C

    Article  Google Scholar 

  • Goleman D, 1995. Emotional Intelligence: Why It Can Matter More than IQ. Bloomsbury, Inc., New York, NY, England.

    Google Scholar 

  • Goleman D, 1998. Working with Emotional Intelligence. Bloomsbury, Inc., New York, NY, England.

    Google Scholar 

  • Güzeldere G, Franchi S, 1995. Dialogues with colorful “personalities” of early AI. Stanford Human Rev, 4(2):161–169.

    Google Scholar 

  • He KM, Zhang YX, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770–778. https://doi.org/10.1109/CVPR.2016.90

    Google Scholar 

  • He XD, Deng L, 2013. Speech-centric information processing: an optimization-oriented approach. Proc IEEE, 101(5): 116–1135. https://doi.org/10.1109/JPROC.2012.2236631

    Article  Google Scholar 

  • He XD, Deng L, 2017. Deep learning for image-to-text generation: a technical overview. IEEE Signal Process Mag, 34(6):109–116. https://doi.org/10.1109/MSP.2017.2741510

    Article  Google Scholar 

  • Hemphill CT, Godfrey JJ, Doddington GR, 1990. The ATIS spoken language systems pilot corpus. Proc Workshop on Speech and Natural Language, p.96–101. https://doi.org/10.3115/116580.116613

    Chapter  Google Scholar 

  • Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597

    Article  Google Scholar 

  • Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  • Huang PS, He XD, Gao JF, et al., 2013. Learning deep structured semantic models for web search using click through data. Proc 22nd ACM Int Conf on Information & Knowledge Management, p.2333–2338. https://doi.org/10.1145/2505515.2505665

    Google Scholar 

  • Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128–3137. https://doi.org/10.1109/CVPR.2015.7298932

    Google Scholar 

  • Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25th Int Conf on Neural Information Processing Systems, p.1097–1105.

    Google Scholar 

  • Levin E, Narayanan S, Pieraccini R, et al., 2000. The ATT-DARPA ommunicator mixed-initiative spoken dialog system. 6th Int Conf on Spoken Language Processing.

    Google Scholar 

  • Li JW, Galley M, Brockett C, et al., 2016. A persona-based neural conversation model. Proc 54th Annual Meeting of the Association for Computational Linguistics, p.944–1003.

    Google Scholar 

  • Li X, Mou LL, Yan R, et al., 2016. Stalematebreaker: a proactive content-introducing approach to automatic humancomputer conversation. Proc 25th Int Joint Conf on Artificial Intelligence, p.2845–2851.

    Google Scholar 

  • Liu XD, Gao JF, He XD, et al., 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. Proc Annual Conf on North American Chapter of the ACL, p.912–921.

    Google Scholar 

  • Lu ZD, Li H, 2013. A deep architecture for matching short texts. Proc Int Conf on Neural Information Processing Systems, p.1367–1375.

    Google Scholar 

  • Maslow AH, 1943. A theory of human motivation. Psychol Rev, 50(4):370–396.

    Article  Google Scholar 

  • Mathews A, Xie LX, He XM, 2016. SentiCap: generating image descriptions with sentiments. Proc 30th AAAI Conf on Artificial Intelligence, p.3574–3580.

    Google Scholar 

  • Mesnil G, He X, Deng L, et al., 2013. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. Interspeech, p.3771–3775.

    Google Scholar 

  • Mesnil G, Dauphin Y, Yao KS, et al., 2015. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans Audio Speech Lang Process, 23(3):530–539. https://doi.org/10.1109/TASLP.2014.2383614

    Article  Google Scholar 

  • Mikolov T, Sutskever I, Chen K, et al., 2013. Distributed representations of words and phrases and their compositionality. Proc 26th Int Conf on Neural Information Processing Systems, p.3111–3119.

    Google Scholar 

  • Mower E, Mataric MJ, Narayanan S, 2011. A framework for automatic human emotion classification using emotion profiles. IEEE Trans Audio Speech Lang Process, 19(5): 1057–1070. https://doi.org/10.1109/TASL.2010.2076804

    Article  Google Scholar 

  • Murphy KR, 2007. A critique of emotional intelligence: what are the problems and how can they be fixed? Pers Psychol-, 60(1):235–238. https://doi.org/10.1111/j.1744-6570.2007.00071_2.x

    Article  MathSciNet  Google Scholar 

  • Price PJ, 1990. Evaluation of spoken language systems: the ATIS domain. Proc Workshop on Speech and Natural Language, p.91–95. https://doi.org/10.3115/116580.116612

    Chapter  Google Scholar 

  • Qian Y, Fan YC, Hu WP, et al., 2014. On the training aspects of deep neural network (DNN) for parametric TTS synthesis. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3829–3833. https://doi.org/10.1109/ICASSP.2014.6854318

    Google Scholar 

  • Raux A, Langner B, Bohus D, et al., 2005. Let’s go public! Taking a spoken dialog system to the real world. 9th European Conf on Speech Communication and Technology, p.885–888.

    Google Scholar 

  • Rudnicky AI, Thayer EH, Constantinides PC, et al., 1999. Creating natural dialogs in the Carnegie Mellon communicator system. 6th European Conf on Speech Communication and Technology.

    Google Scholar 

  • Sarikaya R, 2017. The technology behind personal digital assistants—an overview of the system architecture and key components. IEEE Signal Process Mag, 34(1):67–81. https://doi.org/10.1109/MSP.2016.2617341

    Article  Google Scholar 

  • Sarikaya R, Crook PA, Marin A, et al., 2016. An overview of end-to-end language understanding and dialog management for personal digital assistants. Proc IEEE Spoken Language Technology Workshop, p.391–397. https://doi.org/10.1109/SLT.2016.7846294

    Google Scholar 

  • Seneff S, Hurley E, Lau R, et al., 1998. Galaxy-II: a reference architecture for conversational system development. 5th Int Conf on Spoken Language Processing.

    Google Scholar 

  • Serban IV, Klinger T, Tesauro G, et al., 2017. Multiresolution recurrent neural networks: an application to dialogue response generation. AAAI, p.3288–3294.

    Google Scholar 

  • Shawar BA, Atwell E, 2007. Different measurements metrics to evaluate a chatbot system. Proc Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, p.89–96.

    Google Scholar 

  • Shieber SM, 1994. Lessons from a restricted Turing test. Commun ACM, 37(6):70–78. https://doi.org/10.1145/175208.175217

    Article  Google Scholar 

  • Socher R, Perelygin A, Wu JY, et al., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc Conf on Empirical Methods in Natural Language Processing, p.1631–1642.

    Google Scholar 

  • Song R, 2018. Image to poetry by cross-modality understanding with unpaired data. Personal Communication.

    Google Scholar 

  • Sordoni A, Galley M, Auli M, et al., 2015. A neural network approach to context-sensitive generation of conversational responses. Proc Annual Conf on North American Chapter of the ACL, p.196–205.

    Google Scholar 

  • Sutskever I, Vinyals O, Le QVV, 2014. Sequence to sequence learning with neural networks. NIPS, p.1–9. https://doi.org/10.1007/s10107-014-0839-0

    Google Scholar 

  • Tokuhisa R, Inui K, Matsumoto Y, 2008. Emotion classification using massive examples extracted from the web. Proc 22nd Int Conf on Computational Linguistics, p.881–888.

    Google Scholar 

  • Tur G, de Mori R, 2011. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.

    Book  MATH  Google Scholar 

  • Tur G, Deng L, 2011. Intent determination and spoken utterance classification. In: Tur G, de Mori R (Eds.), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.

    Chapter  Google Scholar 

  • Turing A, 1950. Computing machinery and intelligence. Mind, 59:433–460.

    Article  MathSciNet  Google Scholar 

  • van den Oord A, Dieleman S, Zen HG, et al., 2016. WaveNet: a generative model for raw audio. 9th ISCA Speech Synthesis Workshop, p.125.

    Google Scholar 

  • Vinyals O, Le QV, 2015. A neural conversational model. Proc 31st Int Conf on Machine Learning.

    Google Scholar 

  • Vinyals O, Toshev A, Bengio S, et al., 2015. Show and tell: a neural image caption generator. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

    Google Scholar 

  • Walker M, Aberdeen J, Boland J, et al., 2001. DARPA Communicator dialog travel planning systems: the June 2000 data collection. Proc 7th European Conf on Speech Communication and Technology.

    Google Scholar 

  • Walker M, Rudnicky AI, Aberdeen JS, et al., 2002, DARPA Communicator evaluation: progress from 2000 to 2001. Proc Int Conf on Spoken Language Processing, p. 273–276.

    Google Scholar 

  • Wallace RS, 2009. The anatomy of A.L.I.C.E. In: Epstein R, Roberts G, Beber G (Eds.), Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer, Dordrecht, p.181–210. https://doi.org/10.1007/978-1-4020-6710-5_13

    Chapter  Google Scholar 

  • Wang HN, He XD, Chang MW, et al., 2013. Personalized ranking model adaptation for web search. Proc 36th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.323–332. https://doi.org/10.1145/2484028.2484068

    Google Scholar 

  • Wang YY, Deng L, Acero A, 2011. Semantic frame-based spoken language understanding. In: Tur G, de Mori R (Eds.), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.

    Google Scholar 

  • Wang ZY, Wang HX, Wen JR, et al., 2015. An inference approach to basic level of categorization. Proc 24th ACM Int Conf on Information and Knowledge Management, p.653–662. https://doi.org/10.1145/2806416.2806533

    Google Scholar 

  • Weizenbaum J, 1966. ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM, 9(1):36–45. https://doi.org/10.1145/357980.357991

    Article  Google Scholar 

  • Wen TH, Vandyke D, Mrksic N, et al., 2016. A network-based end-to-end trainable task-oriented dialogue system. Proc 15th Conf on European Chapter of the Association for Computational Linguistics, p.438–449.

    Google Scholar 

  • Williams JD, Young S, 2007. Partially observable Markov decision processes for spoken dialog systems. Comput Speech Lang, 21(2):393–422. https://doi.org/10.1016/j.csl.2006.06.008

    Article  Google Scholar 

  • Xiong W, Droppo J, Huang XD, et al., 2016. Achieving human parity in conversational speech recognition. IEEE/ACM Trans Audio Speech Lang Process, in press. https://doi.org/10.1109/TASLP.2017.2756440

    Google Scholar 

  • Yan R, Song YP, Wu H, 2016. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. Proc 39th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.55–64. https://doi.org/10.1145/2911451.2911542

    Google Scholar 

  • Yang ZC, He XD, Gao JF, et al., 2016a. Stacked attention networks for image question answering. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.21–29. https://doi.org/10.1109/CVPR.2016.10

    Google Scholar 

  • Yang ZC, Yang DY, Dyer C, et al., 2016b. Hierarchical attention networks for document classification. Proc 15th Annual Conf on North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.1480–1489.

    Google Scholar 

  • Yu Z, Xu ZY, Black AW, et al., 2016. Chatbot evaluation and database expansion via crowdsourcing. Proc REWOCHAT Workshop of LREC.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heung-yeung Shum.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shum, Hy., He, Xd. & Li, D. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Frontiers Inf Technol Electronic Eng 19, 10–26 (2018). https://doi.org/10.1631/FITEE.1700826

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.1700826

Keywords

CLC number

Navigation