Vowel priority lip matching scheme and similarity evaluation model based on humanoid robot Ren-Xin


At present, the significance of humanoid robots dramatically increased while this kind of robots rarely enters human life because of its immature development. The lip shape of humanoid robots is crucial in the speech process since it makes humanoid robots look like real humans. Many studies show that vowels are the essential elements of pronunciation in all languages in the world. Based on the traditional research of viseme, we increased the priority of the smooth transition of lip between vowels and propose a lip matching scheme based on vowel priority. Additionally, we also designed a similarity evaluation model based on the Manhattan distance by using computer vision lip features, which quantifies the lip shape similarity between 0–1 provides an effective recommendation of evaluation standard. Surprisingly, this model successfully compensates the disadvantages of lip shape similarity evaluation criteria in this field. We applied this lip-matching scheme to Ren-Xin humanoid robot and performed robot teaching experiments as well as a similarity comparison experiment of 20 sentences with two males and two females and the robot. Notably, all the experiments have achieved excellent results.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. Binyong Y, Felley M (1990) Chinese romanization: pronunciation & orthography. Peking

  2. Cootes T, Baldock ER, Graham J (2000) An introduction to active shape models. Image Process Anal 243657:223–248

    Google Scholar 

  3. Cootes T, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685. https://doi.org/10.1109/34.927467

    Article  Google Scholar 

  4. Dai K, Zhang Y, Wang D et al (2020) High-performance long-term tracking with meta-updater. arXiv preprint arXiv:2004.00305

  5. Fan X, Yang X (2017) A speech-driven lip synchronization method. J Donghua Univ (Nat Sci) 4:2 (in Chinese)

    Google Scholar 

  6. Fu K, Sun L, Kang X et al (2019) Text detection for natural scene based on mobileNet V2 and U-Net. In: 2019 IEEE international conference on mechatronics and automation (ICMA), pp 1560–1564. https://doi.org/10.1109/ICMA.2019.8816384

  7. Hara F, Endou K, Shirata S (1997) Lip-Configuration Control Of A Mouth Robot For Japanese Vowels. In: Proceedings 6th IEEE International workshop on robot and human communication, pp 412–418. https://doi.org/10.1109/ROMAN.1997.647022

  8. Herath DC, Jochum E, Vlachos E (2017) An experimental study of embodied interaction and human perception of social presence for interactive robots in public settings. IEEE Trans Cogn Dev Syst 10(4):1096–1105. https://doi.org/10.1109/TCDS.2017.2787196

    Article  Google Scholar 

  9. Hwang J, Tani J (2017) Seamless integration and coordination of cognitive skills in humanoid robots: a deep learning approach. IEEE Trans Cogn Dev Syst 10(2):345–358. https://doi.org/10.1109/TCDS.2017.2714170

    Article  Google Scholar 

  10. Hyung HJ, Ahn BK, Choi D et al (2016) Evaluation of a Korean Lip-sync system for an android robot. In: 2016 13th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp 78–82. https://doi.org/10.1109/URAI.2016.7734025

  11. Ishi CT, Machiyashiki D, Mikata R et al (2018) A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robot Autom Lett 3(4):3757–3764. https://doi.org/10.1109/LRA.2018.2856281

    Article  Google Scholar 

  12. Keating PA, Huffman MK (1984) Vowel variation in Japanese. Phonetica 41(4):191–207. https://doi.org/10.1159/000261726

    Article  Google Scholar 

  13. Kim TH (2008) A study on Korean lip-sync for animation characters-based on lip-sync technique in english-speaking animations. Cartoon Animat Stud 13:97–114 (in Korean)

    Google Scholar 

  14. Kuindersma S, Deits R, Fallon M et al (2016) Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot. Auton Robots 40(3):429–455. https://doi.org/10.1007/s10514-015-9479-3

    Article  Google Scholar 

  15. Li X, Wang T (2018) A long time tracking with BIN-NST and DRN. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-1025-7

    Article  Google Scholar 

  16. Li P, Wang D, Wang L et al (2018) Deep visual tracking: review and experimental comparison. Pattern Recogn 76:323–338. https://doi.org/10.1016/j.patcog.2017.11.007

    Article  Google Scholar 

  17. Liu Z, Ren F, Kang X (2019) Research on the effect of different speech segment lengths on speech emotion recognition based on LSTM. In: Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering, pp 491–499. https://doi.org/10.18178/wcse.2019.06.073

  18. Long T (2019) Research on application of athlete gesture tracking algorithms based on deep learning. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-019-01575-w

    Article  Google Scholar 

  19. Lu H, Li Y, Chen M et al (2018) Brain intelligence: go beyond artificial intelligence. Mob Netw Appl 23(2):368–375. https://doi.org/10.1007/s11036-017-0932-8

    Article  Google Scholar 

  20. Luo RC, Chang SR, Huang CC et al (2011) Human robot interactions using speech synthesis and recognition with lip synchronization. In: 2011 IECON 2011–37th Annual Conference of the IEEE Industrial Electronics Society, pp 171–176. https://doi.org/10.1109/IECON.2011.6119307

  21. Miyazaki T, Nakashima T (2015) Analysis of mouth shape deformation rate for generation of Japanese utterance images automatically. In: Software engineering research, management and applications, pp 75–86. https://doi.org/10.1007/978-3-319-11265-7_6

  22. Morishima S, Harashima H (1991) A media conversion from speech to facial image for intelligent man-machine interface. IEEE J Sel Areas Commun 9(4):594–600. https://doi.org/10.1109/49.81953

    Article  Google Scholar 

  23. Nishikawa K, Takanobu H, Mochida T et al (2004) Speech production of an advanced talking robot based on human acoustic theory. In: 2004 IEEE International Conference on Robotics and Automation(ICRA), pp 3213–3219. https://doi.org/10.1109/ROBOT.2004.1308749

  24. Oh KG, Jung C Y, Lee Y G et al (2010) Real-time lip synchronization between text-to-speech (TTS) system and robot mouth. In: 19th International symposium in robot and human interactive communication, pp 620–625. https://doi.org/10.1109/ROMAN.2010.5598656

  25. Ren F (2009) Affective information processing and recognizing human emotion. Electron Notes Theor Comput Sci 225:39–50. https://doi.org/10.1016/j.entcs.2008.12.065

    Article  Google Scholar 

  26. Ren F, Bao Y (2020) A review on human-computer interaction and intelligent robots. Int J Inf Technol Decis Mak 19(01):5–47. https://doi.org/10.1142/S0219622019300052

    Article  Google Scholar 

  27. Ren F, Huang Z (2016) Automatic facial expression learning method based on humanoid robot XIN-REN. IEEE Trans Hum Mach Syst 46(6):810–821. https://doi.org/10.1109/THMS.2016.2599495

    Article  Google Scholar 

  28. Ren F, Kang X, Quan C (2015) Examining accumulated emotional traits in suicide blogs with an emotion topic model. IEEE J Biomed Health Inform 20(5):1384–1396. https://doi.org/10.1109/JBHI.2015.2459683

    Article  Google Scholar 

  29. Ren F, Matsumoto K (2015) Semi-automatic creation of youth slang corpus and its application to affective computing. IEEE Trans Affect Comput 7(2):176–189. https://doi.org/10.1109/TAFFC.2015.2457915

    Article  Google Scholar 

  30. Saitoh T, Konishi R (2010) Profile lip reading for vowel and word recognition. In: 2010 20th International conference on pattern recognition, pp 1356–1359. https://doi.org/10.1109/ICPR.2010.335

  31. Sulistijono IA, Baiqunni HH, Darojah Z et al (2014) Vowel recognition system of Lipsynchrobot in lips gesture using neural network. In: 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp 1751–1756. https://doi.org/10.1109/FUZZ-IEEE.2014.6891843

  32. Sun Y, Wang X, Tang X (2013) Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3476–3483. https://doi.org/10.1109/CVPR.2013.446

  33. Verner IM, Polishuk A, Krayner N (2016) Science class with RoboThespian: using a robot teacher to make science fun and engage students. IEEE Robot Autom Mag 23(2):74–80. https://doi.org/10.1109/MRA.2016.2515018

    Article  Google Scholar 

  34. Yan J (1998) Research on the viseme of chinese phonetics. Comput Eng Des 19(1):31–34 (in Chinese)

    MathSciNet  Google Scholar 

  35. You ZJ, Shen CY, Chang C W et al (2006) A robot as a teaching assistant in an English class. In: Sixth IEEE international conference on advanced learning technologies (ICALT'06), pp 87–91. https://doi.org/10.1109/ICALT.2006.1652373

  36. Zeng H, Hu D, Hu Z (2013) Simple analyzing on matching mechanism between Chinese speech and mouth shape. Audio Eng 10:44–48 (in Chinese)

    Google Scholar 

Download references


This research has been partially supported by JSPS KAKENHI Grant no. 19K20345.

Author information



Corresponding author

Correspondence to Fuji Ren.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Kang, X., Nishide, S. et al. Vowel priority lip matching scheme and similarity evaluation model based on humanoid robot Ren-Xin. J Ambient Intell Human Comput (2020). https://doi.org/10.1007/s12652-020-02175-9

Download citation


  • Lip matching
  • Vowel priority
  • Similarity evaluation
  • Humanoid robot
  • Manhattan distance