International Journal of Speech Technology

, Volume 22, Issue 3, pp 697–709 | Cite as

Speech and language processing for assessing child–adult interaction based on diarization and location

  • John H. L. HansenEmail author
  • Maryam Najafian
  • Rasa Lileikyte
  • Dwight Irvin
  • Beth Rous


Understanding and assessing child verbal communication patterns is critical in facilitating effective language development. Typically speaker diarization is performed to explore children’s verbal engagement. Understanding which activity areas stimulate verbal communication can help promote more efficient language development. In this study, we present a two-stage children vocal engagement prediction system that consists of (1) a near to real-time, noise robust system that measures the duration of child-to-adult and child-to-child conversations, and tracks the number of conversational turn-takings, (2) a novel child location tracking strategy, that determines in which activity areas a child spends most/least of their time. A proposed child–adult turn-taking solution relies exclusively on vocal cues observed during the interaction between a child and other children, and/or classroom teachers. By employing a threshold optimized speech activity detection using a linear combination of voicing measures, it is possible to achieve effective speech/non-speech segment detection prior to conversion assessment. This TO-COMBO-SAD reduces classification error rates for adult-child audio by 21.34% and 27.3% compared to a baseline i-Vector and standard Bayesian Information Criterion diarization systems, respectively. In addition, this study presents a unique location tracking system adult-child that helps determine the quantity of child–adult communication in specific activity areas, and which activities stimulate voice communication engagement in a child–adult education space. We observe that our proposed location tracking solution offers unique opportunities to assess speech and language interaction for children, and quantify the location context which would contribute to improve verbal communication.


Child speech Speaker diarization Speech activity detection I-Vector Language environment monitoring 



Authors wish to express our sincere thanks to Univ. of Kentucky for the joint collaboration efforts on this study. In particular, wish to thank Ying Luo for collecting, organizing the child database used in this study.


  1. Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356–370.CrossRefGoogle Scholar
  2. Bahari, M. H., McLaren, M., van Leeuwen, D. A., et al. (2014). Speaker age estimation using i-vectors. Engineering Applications of Artificial Intelligence, 34, 99–108.CrossRefGoogle Scholar
  3. Barras, C., Zhu, X., Meignier, S., & Gauvain, J.-L. (2006). Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1505–1512.CrossRefGoogle Scholar
  4. Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In: Proceedings of the institute of phonetic sciences (Vol. 17, pp. 97–110). Amsterdam.Google Scholar
  5. Bonastre, J.-F., Scheffer, N., Matrouf, D., Fredouille, C., Larcher, A., Preti, A., Pouchoulin, G., Evans, N.W., Fauve, B.G., & Mason, J.S. (2008). ALIZE/spkdet: A state-of-the-art open source software for speaker recognition. In: Odyssey. p. 20.Google Scholar
  6. Connaghan, D., Hughes, S., May, G., Kelly, P., Conaire, C.Ó., O’Connor, N.E., O’Gorman, D., Smeaton, A.F., & Moyna, N. (2009). A sensing platform for physiological and contextual feedback to tennis athletes. In: Wearable and implantable body sensor networks, 2009 (pp. 224–229). BSN 2009. IEEE.Google Scholar
  7. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.zbMATHGoogle Scholar
  8. Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011a). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRefGoogle Scholar
  9. Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D., & Dehak, R. (2011b). Language recognition via i-vectors and dimensionality reduction. In Twelfth Annual Conference of the International Speech Communication Association, INTERSPEECH.Google Scholar
  10. Delano, M., & Snell, M. E. (2006). The effects of social stories on the social engagement of children with autism. Journal of Positive Behavior Interventions, 8(1), 29–42.CrossRefGoogle Scholar
  11. Gauvain, J.-L., & Lee, C.-H. (1991). Bayesian learning of Gaussian mixture densities for hidden Markov models. In Speech and natural language: Proceedings of a Workshop Held at Pacific Grove, California, 19-22 February, 1991.Google Scholar
  12. Ghaemmaghami, H., Dean, D., & Sridharan, S. (2015). A cluster-voting approach for speaker diarization and linking of Australian broadcast news recordings. In ICASSP (pp. 4829–4833). IEEE.Google Scholar
  13. Ghaemmaghami, H., Dean, D., Vogt, R. & Sridharan, S. (2011). Extending the task of diarization to speaker attribution. In Interspeech 2011, 28–31 August 2011, Florence.Google Scholar
  14. Graciarena, M., Alwan, A., Ellis, D., Franco, H., Ferrer, L., Hansen, J.H., Janin, A., Lee, B.S., Lei, Y., & Mitra, V., et al., (2013). All for one: feature combination for highly channel-degraded speech activity detection. In INTERSPEECH (pp. 709–713).Google Scholar
  15. Gravier, G., Betser, M., & Ben, M. (2010). AudioSeg: Audio segmentation toolkit, release 1.2. IRISA, January.Google Scholar
  16. Gupta, R., Bone, D., Lee, S., & Narayanan, S. (2016). Analysis of engagement behavior in children during dyadic interactions using prosodic cues. Computer Speech & Language, 37, 47–66.CrossRefGoogle Scholar
  17. Hart, B., & Risley, T. R. (1995). Meaningful differences in the everyday experience of young American children. Baltimore, MD: Paul H Brookes Publishing.Google Scholar
  18. Huijbregts, M. A.H. (2008). Segmentation, diarization and speech transcription: Surprise data unraveled. Ph.D. thesis, Centre for Telematics and Information Technology University of Twente.Google Scholar
  19. Kasari, C., Gulsrud, A. C., Wong, C., Kwon, S., & Locke, J. (2010). Randomized controlled caregiver mediated joint engagement intervention for toddlers with autism. Journal of Autism and Developmental Disorders, 40(9), 1045–1056.CrossRefGoogle Scholar
  20. Meignier, S., & Merlin, T. (2010). Lium spkdiarization: an open source toolkit for diarization. In CMU SPUD Workshop (Vol. 2010). Le Mans: Universite du Maine.Google Scholar
  21. Meignier, S., Moraru, D., Fredouille, C., Bonastre, J.-F., & Besacier, L. (2006). Step-by-step and integrated approaches in broadcast news speaker diarization. Computer Speech & Language, 20(2), 303–330.CrossRefGoogle Scholar
  22. Najafian, M., Irvin, D., Luo, Y., Rous, B.S., & Hansen, J.H. (2016). Employing speech and location information for automatic assessment of child language environments. In Sensing, processing and learning for intelligent machines (SPLINE). IEEE, pp. 1–5.Google Scholar
  23. Phebey, T. (2010). The Ubisense assembly control solution for BMW solution for BMW. Proccedings of RFID Journal Europe Live. Retrieved 18 August, 2016.Google Scholar
  24. Reynolds, D.A., Singer, E., Carlson, B.A., O’Leary, G.C., McLaughlin, J.J., & Zissman, M.A. (1998). Blind clustering of speech utterances based on speaker and language characteristics. In Fifth International Conference on spoken language processing—ICSP.Google Scholar
  25. Riehle, T.H., Lichter, P., Giudice, N.A. (2008). An indoor navigation system to support the visually impaired. In Engineering in Medicine and Biology Society, 2008. EMBS 2008. 30th Annual International Conference of the IEEE. IEEE, pp. 4435–4438.Google Scholar
  26. Sadjadi, S. O., & Hansen, J. H. (2013). Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Processing Letters, 20(3), 197–200.CrossRefGoogle Scholar
  27. Safavi, S., Russell, M., & Jančovič, P. (2014). Identification of age-group from children’s speech by computers and humans. In Fifteenth Annual Conference of the International Speech Communication Association—INTERSPEECH.Google Scholar
  28. Scheirer, E., & Slaney, M. (1997). Construction and evaluation of a robust multifeature speech/music discriminator. In IEEE International Conference on acoustics, speech, and signal processing, 1997. IEEE. ICASSP-97 (Vol. 2, pp. 1331–1334).Google Scholar
  29. Siegler, M.A., Jain, U., Raj, B., & Stern, R.M., (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA speech recognition workshop. Vol. 1997.Google Scholar
  30. Swedberg, C. (2011). Bmw finds the right tool. RFID Journal, 1, 2009.Google Scholar
  31. Tranter, S. E., & Reynolds, D. A. (2006). An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1557–1565.CrossRefGoogle Scholar
  32. Vijayasenan, D., & Valente, F. (2012). Diartk: An open source toolkit for research in multistream speaker diarization and its application to meetings recordings. In Thirteenth Annual Conference of the International Speech Communication Association—INTERSPEECH. Portland.Google Scholar
  33. Walker, D., Greenwood, C., Hart, B., & Carta, J. (1994). Prediction of school outcomes based on early language production and socioeconomic factors. Child Development, 65, 606–621.CrossRefGoogle Scholar
  34. Woźniak, M., Odziemczyk, W., & Nagórski, K. (2013). Investigation of practical and theoretical accuracy of wireless indoor positioning system ubisense. Reports on Geodesy and Geoinformatics, 95(1), 36–48.Google Scholar
  35. Yella, S. H. (2015). Speaker diarization of spontaneous meeting room conversations. PhD thesis, EPFL, Lausanne.Google Scholar
  36. Zhao, Q., Kawamata, M., & Higuchi, T. (1988). Controllability, observability and model reduction of separable denominator MD systems. IEICE Transactions (1976–1990), 71(5), 505–513.Google Scholar
  37. Ziaei, A., Kaushik, L., Sangwan, A., Hansen, J.H., & Oard, D.W. (2014). Speech activity detection for nasa apollo space missions: Challenges and solutions. In Fifteenth Annual Conference of the International Speech Communication Association.Google Scholar
  38. Ziaei, A., Sangwan, A., & Hansen, J.H. (2013). Prof-Life-Log: Personal interaction analysis for naturalistic audio streams. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7770–7774). IEEE.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Center for Robust Speech SystemsUniversity of Texas at DallasRichardsonUSA
  2. 2.Life Span Institute University of KansasKansas CityUSA
  3. 3.College of EducationUniversity of KentuckyLexingtonUSA

Personalised recommendations