A novel focus encoding scheme for addressee detection in multiparty interaction using machine learning algorithms

Abstract

Addressee detection is a fundamental task for seamless dialogue management and turn taking in human-agent interaction. Though addressee detection is implicit in dyadic interaction, it becomes a challenging task when more than two participants are involved. This article proposes multiple addressee detection models based on smart feature selection and focus encoding schemes. The models are trained using different machine learning and deep learning algorithms. This research work improves existing baseline accuracies for addressee prediction on two datasets. In addition, the article explores the impact of different focus encoding schemes in several addressee detection cases. Finally, an implementation strategy for addressee detection model in real-time is discussed.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    The annotation is available at: https://doi.org/10.6084/m9.figshare.13297775.

  2. 2.

    http://agent.roboslang.org.

References

  1. 1.

    Akker H, Akker R (2009) Are you being addressed?-real-time addressee detection to support remote participants in hybrid meetings. In: SIGDIAL, pp 21–28

  2. 2.

    Akker R, Traum D (2009) A comparison of addressee detection methods for multiparty conversations. In: SEMDIAL’09, pp 99–106

  3. 3.

    Baba N, Huang HH, Nakano YI (2011) Identifying utterances addressed to an agent in multiparty human–agent conversations. In: International workshop on IVA’11, pp 255–261

  4. 4.

    Bakx I, Van Turnhout K, Terken J (2003) Facial orientation during multi-party interaction with information kiosks. In: INTERACT 2003 Zurich, Switzerland, pp 163–170

  5. 5.

    Carletta J (2007) Unleashing the killer corpus: experiences in creating the multi-everything ami meeting corpus. Lang Resour Eval 41(2):181–190

    Article  Google Scholar 

  6. 6.

    Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: SIGKDD. ACM, pp 785–794

  7. 7.

    Dietterich TG et al (2002) Ensemble learning. Handb Brain Theory Neural Netw 2:110–125

    Google Scholar 

  8. 8.

    Elith J, Leathwick JR, Hastie T (2008) A working guide to boosted regression trees. J Anim. Ecol. 77(4):802–813

    Article  Google Scholar 

  9. 9.

    Galley M, McKeown K, Hirschberg J, Shriberg E (2004) Identifying agreement and disagreement in conversational speech: use of Bayesian networks to model pragmatic dependencies. In: ACL’04, p 669

  10. 10.

    Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42

    Article  Google Scholar 

  11. 11.

    Goffman E (1981) Forms of talk, University of Pennsylvania publications in conduct and communication. University of Pennsylvania Press, Philadelphia

    Google Scholar 

  12. 12.

    Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat. Interface 2(3):349–360

    MathSciNet  Article  Google Scholar 

  13. 13.

    Hawkins DM (2004) The problem of overfitting. J Chem Inform Comput Sci 44(1):1–12

    MathSciNet  Article  Google Scholar 

  14. 14.

    Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. Intell Syst Appl 13(4):18–28

    Article  Google Scholar 

  15. 15.

    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  16. 16.

    Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley, New York

    Google Scholar 

  17. 17.

    Jovanovic N (2007) To whom it may concern-addressee identification in face-to-face meetings

  18. 18.

    Jovanovic N, Akker R, Nijholt A (2006) A corpus for studying addressing behaviour in multi-party dialogues. LREC’06 40(1):5–23

    Google Scholar 

  19. 19.

    Jovanovic N, op den Akker R (2004) Towards automatic addressee identification in multi-party dialogues. In: SIGdial@HLT-NAACL’04

  20. 20.

    Kiranyaz S, Ince T, Abdeljaber O, Avci O, Gabbouj M (2019) 1-D convolutional neural networks for signal processing applications. In: ICASSP’19, pp 8360–8364

  21. 21.

    Koutsombogera M, Vogel C (2018) Modeling collaborative multimodal behavior in group dialogues: the multisimo corpus. In: LREC-2018

  22. 22.

    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS’12, pp 1097–1105

  23. 23.

    Kruse R, Borgelt C, Klawonn F, Moewes C, Steinbrecher M, Held P (2013) Multi-layer perceptrons. In: Computational Intelligence, pp 47–81

  24. 24.

    Le MT, Shimizu N, Miyazaki T, Shinoda K (2018) Deep learning based multi-modal addressee recognition in visual scenes with utterances. In: IJCAI, pp 1546–1553

  25. 25.

    Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R News 2(3):18–22

    Google Scholar 

  26. 26.

    Malik U, Barange M, Ghannad N, Saunier J, Pauchet A (2019) A generic machine learning based approach for addressee detection in multiparty interaction. In: IVA ’19, pp 119–126

  27. 27.

    McCowan I, Carletta J, Kraaij W, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V et al (2005) The ami meeting corpus. In: MB’05, vol 88, p 100

  28. 28.

    Melamud O, Goldberger J, Dagan I (2016) context2vec: learning generic context embedding with bidirectional lstm. In: 20th SIGNLL conference on computational natural language learning, pp 51–61

  29. 29.

    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. JMLR 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  30. 30.

    Recasens A, Khosla A, Vondrick C, Torralba A (2015) Where are they looking? In: Adv. in neural information processing systems, pp 199–207

  31. 31.

    Rish I et al (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46

  32. 32.

    Sacks H, Schegloff E, Jefferson G (1974) A simplest systematics for the organization of turn-taking in conversation. Language 50:696–735

    Article  Google Scholar 

  33. 33.

    Searle JR, Searle JR (1969) Speech acts: an essay in the philosophy of language, vol. 626, Cambridge university press

  34. 34.

    Serban O, Pauchet A (2014) Agentslang: a new distributed interactive system. current approaches and performance. In: ICAART14, pp 596–603

  35. 35.

    Smit SK, Eiben AE (2009) Comparing parameter tuning methods for evolutionary algorithms. In: CEC’09, pp 399–406

  36. 36.

    Traum DR, Robinson S, Stephan J (2004) Evaluation of multi-party virtual reality dialogue interaction. In: LREC’04, pp 1699–1702

  37. 37.

    Traum DR, Robinson S, Stephan J (2006) Evaluation of multi-party reality dialogue interaction. Tech. rep., University of Southern California Marina Del Rey CA Inst For Creative Technologies

  38. 38.

    Vertegaal R (1998) Look who’s talking to whom. Mediating joint attention in multiparty. Doctoral Thesis, Twente University, the Netherlands

  39. 39.

    Zhang ML, Zhou ZH (2005) A k-nearest neighbor based algorithm for multi-label classification. In: GRC’05, vol 2. ACM, pp 718–721

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Usman Malik.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the DAISI project, cofunded by the European Union with the European Regional Development Fund (ERDF), by the French Agence Nationale de la Recherche and by the Regional Council of Normandie.

Appendix: Classifiers and parameters for experimentation

Appendix: Classifiers and parameters for experimentation

Classifier AMI parameters MULTISIMO parameters
XGB Learning_rate =0.1, n_estimators=140, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= ’multi:softmax’, nthread=4, scale_pos_weight=1 learning_rate =0.1, n_estimators=130, max_depth=3, min_child_weight=1, gamma=0, subsample=0.6, colsample_bytree=0.5, objective= ’multi:softmax’, nthread=4, scale_pos_weight=1
ET ’Bootstrap’: true, ’criterion’: ’gini’, ’max_features’: ’sqrt’, ’n_estimators’: 1000 ’bootstrap’: True, ’criterion’: ’entropy’, ’max_features’: ’sqrt’, ’n_estimators’: 200
ADB Base_estimtor = “DecisionTree”, ’max_features’: 30, ’n_estimators’:800 Base_estimtor = “DecisionTree”, ’max_features’: 30, ’n_estimators’:800
MLP ’Activation’: ’tanh’, ’alpha’: 0.05, ’hidden_layer_sizes’: (100,), ’learning_rate’: ’adaptive’, ’solver’: ’adam’ activation = ’tanh’, alpha = 0.0001, hidden_layer_sizes = (50, 100, 50), learning_rate=’constant’, solver = ’sgd’, max_iter = 100
RF ’Bootstrap’: False, ’criterion’: ’gini’, ’max_features’: ’auto’, ’n_estimators’: 200 ’bootstrap’: True, ’criterion’: ’gini’, ’max_features’: ’sqrt’, ’n_estimators’: 100
LR Penalty=’l2’, C =100 penalty=‘l2’, C =0.1
SVM ’C’: 100, ’gamma’: 0.01 ’C’: 10, ’gamma’: 0.01
NB No Parameters No Parameters
KNN ’n_neighbors’: 8 ’n_neighbors’: 9
LSTM Hidden layer neurons = (100, 50), drop Out = 0.5, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 4, epochs = 100, callbacks = early Stopping, patience = 20 hidden layer neurons = (50, 25), drop Out = 0.2, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 1, epochs = 100, callbacks = early Stopping, patience = 20
Bi-LSTM Hidden layer neurons = (100, 50), drop Out = 0.5, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 4, epochs = 100, callbacks = early Stopping, patience = 20 hidden layer neurons = (50, 25), drop Out = 0.2, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 1, epochs = 100, callbacks = early Stopping, patience = 20
1D-CNN Hidden layer neurons = (100, 50), kernel_size(3,3) drop Out = 0.5, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 4, epochs = 100, calbacks = early Stopping, patience = 20 hidden layer neurons = (50, 25), kernel_size(3,3) drop Out = 0.2, hidden_activation = relu, final_Activation = softmax, loss = cateorical_crossentropy, optimizer = adam, Bach_size = 1, epochs = 100, callbacks = early Stopping, patience = 20

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Malik, U., Barange, M., Saunier, J. et al. A novel focus encoding scheme for addressee detection in multiparty interaction using machine learning algorithms. J Multimodal User Interfaces (2021). https://doi.org/10.1007/s12193-020-00361-9

Download citation

Keywords

  • Human computer interaction
  • Machine learning
  • Intelligent agents