An Interactive Framework for Learning Continuous Actions Policies Based on Corrective Feedback

Abstract

The main goal of this article is to present COACH (COrrective Advice Communicated by Humans), a new learning framework that allows non-expert humans to advise an agent while it interacts with the environment in continuous action problems. The human feedback is given in the action domain as binary corrective signals (increase/decrease the current action magnitude), and COACH is able to adjust the amount of correction that a given action receives adaptively, taking state-dependent past feedback into consideration. COACH also manages the credit assignment problem that normally arises when actions in continuous time receive delayed corrections. The proposed framework is characterized and validated extensively using four well-known learning problems. The experimental analysis includes comparisons with other interactive learning frameworks, with classical reinforcement learning approaches, and with human teleoperators trying to solve the same learning problems by themselves. In all the reported experiments COACH outperforms the other methods in terms of learning speed and final performance. It is of interest to add that COACH has been applied successfully for addressing a complex real-world learning problem: the dribbling of the ball by humanoid soccer players.

This is a preview of subscription content, log in to check access.

References

  1. 1.

    Knox, W.B., Stone, P.: Interactively shaping agents via human reinforcement: the TAMER framework. In: The Fifth International Conference on Knowledge Capture (2009)

  2. 2.

    Argall, B.D., Browning, B., Veloso, M.: Learning robot motion control with demonstration and advice-operators. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (2008)

  3. 3.

    Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction, vol. 1, no. 1. MIT Press, Cambridge (1998)

    Google Scholar 

  4. 4.

    Leottau, L., Celemin, C., Ruiz-del-Solar, J.: Ball dribbling for humanoid biped robots: a reinforcement learning and fuzzy control approach. In: Robocup 2014: Robot World Cup XVIII, pp. 549–561. Springer (2015)

  5. 5.

    Randløv, J., Alstrøm, P.: Learning to drive a bicycle using reinforcement learning and shaping. In: ICML, vol. 98, pp. 463–471 (1998)

  6. 6.

    Vien, N.A., Ertel, W., Chung, T.C.: Learning via human feedback in continuous state and action spaces. Appl. Intell. 39(2), 267–278 (2013)

    Article  Google Scholar 

  7. 7.

    Celemin, C., Ruiz-del-Solar, J.: Interactive learning of continuous actions from corrective advice communicated by humans. In: Robocup 2015: Robot World Cup XIX (2015)

  8. 8.

    Celemin, C., Ruiz-del-Solar, J.: COACH: learning continuous actions from corrective advice communicated by humans. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 581–586 (2015)

  9. 9.

    Chernova, S., Thomaz, A.L.: Robot learning from human teachers. Synth. Lect. Artif. Intell. Mach. Learn. 8(3), 1–121 (2014)

    Article  Google Scholar 

  10. 10.

    Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Rob. Auton. Syst. 57(5), 469–483 (2009)

    Article  Google Scholar 

  11. 11.

    Billard, A., Calinon, S., Dillmann, R., Schaal, S.: Robot programming by demonstration. In: Springer handbook of robotics, pp. 1371–1394. Springer (2008)

  12. 12.

    Billing, E.A., Hellström, T.: A formalism for learning from demonstration. Paladyn J. Behav. Robot. 1(1), 1–13 (2010)

    Article  Google Scholar 

  13. 13.

    Cuayáhuitl, H., van Otterlo, M., Dethlefs, N., Frommberger, L.: Machine learning for interactive systems and robots: a brief introduction. In: Proceedings of the 2nd Workshop on Machine Learning for Interactive Systems: Bridging the Gap Between Perception, Action and Communication, pp. 19–28, ACM (2013)

  14. 14.

    Amershi, S., Cakmak, M., Knox, W.B., Kulesza, T.: Power to the people: the role of humans in interactive machine learning. AI Mag. 35(4), 105–120 (2014)

    Article  Google Scholar 

  15. 15.

    Fails, J.A., Olsen, D.R. Jr: Interactive machine learning. In: Proceedings of the 8th International Conference on Intelligent User Interfaces, pp. 39–45 (2003)

  16. 16.

    Ware, M., Frank, E., Holmes, G., Hall, M., Witten, I.H.: Interactive machine learning: letting users build classifiers. Int. J. Hum. Comput. Stud. 55(3), 281–292 (2001)

    Article  MATH  Google Scholar 

  17. 17.

    Amershi, S., Fogarty, J., Weld, D.: Regroup: interactive machine learning for on-demand group creation in social networks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 21–30 (2012)

  18. 18.

    Ngo, H., Luciw, M., Nagi, J., Forster, A., Schmidhuber, J., Vien, N.A.: Efficient interactive multiclass learning from binary feedback. ACM Trans. Interact. Intell. Syst. 4(3), 1–25 (2014)

    Article  Google Scholar 

  19. 19.

    Aler, R., Garcia, O., Valls, J.M.: Correcting and improving imitation models of humans for robosoccer agents. In: The 2005 IEEE Congress on Evolutionary Computation, 2005, vol. 3, pp. 2402–2409 (2005)

  20. 20.

    Grollman, D.H., Jenkins, O.C.: Learning robot soccer skills from demonstration. In: IEEE 6th International Conference on Development and Learning, 2007. ICDL 2007, pp. 276–281 (2007)

  21. 21.

    Chernova, S., Veloso, M.: Multi-thresholded approach to demonstration selection for interactive robot learning. In: 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 225–232 (2008)

  22. 22.

    Weiss, A., Igelsböck, J., Calinon, S., Billard, A., Tscheligi, M.: Teaching a humanoid: a user study on learning by demonstration with hoap-3. In: The 18th IEEE International Symposium on Robot and Human Interactive Communication, 2009. RO-MAN 2009, pp. 147–152 (2009)

  23. 23.

    Breazeal, C., Berlin, M., Brooks, A., Gray, J., Thomaz, A.L.: Using perspective taking to learn from ambiguous demonstrations. Rob. Auton. Syst. 54(5), 385–393 (2006)

    Article  Google Scholar 

  24. 24.

    Silver, D., Bagnell, J.A., Stentz, A.: Learning from demonstration for autonomous navigation in complex unstructured terrain. Int. J. Rob. Res. 29(12), 1565–1592 (2010)

    Article  Google Scholar 

  25. 25.

    Yu, C.-C., Wang, C.-C.: Interactive learning from demonstration with a multilevel mechanism for collision-free navigation in dynamic environments. In: 2013 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 240–245 (2013)

  26. 26.

    Sweeney, J.D., Grupen, R.: A model of shared grasp affordances from demonstration. In: 2007 7th IEEE-RAS International Conference on Humanoid Robots, pp. 27–35 (2007)

  27. 27.

    Lin, Y., Ren, S., Clevenger, M., Sun, Y.: Learning grasping force from demonstration. In: 2012 IEEE International Conference on Robotics and Automation (ICRA), pp. 1526–1531 (2012)

  28. 28.

    Chernova, S.: Interactive policy learning through con?dence-based autonomy (2009).pdf. J. Artif. Intell. Res. 34, 1–25 (2009)

    MathSciNet  Article  MATH  Google Scholar 

  29. 29.

    Meriçli, C., Veloso, M., Akin, H.: Complementary humanoid behavior shaping using corrective demonstration. In: 2010 10th IEEE-RAS International Conference on Humanoid Robots (Humanoids), pp. 334–339 (2010)

  30. 30.

    Meriçli, Ç., Veloso, M., Akin, H.: Task refinement for autonomous robots using complementary corrective human feedback. Int. J. Adv. Robot. Syst. 8(2), 68–79 (2011)

    Article  Google Scholar 

  31. 31.

    Mericli, C.: Multi-Resolution Model Plus Correction Paradigm for Task and Skill Refinement on Autonomous Robots, Citeseer p. 135 (2011)

  32. 32.

    Argall, B.D.: Learning mobile robot motion control from demonstration and corrective feedback. Thesis (2009)

  33. 33.

    Argall, B.D., Browning, B., Veloso, M.M.: Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot. Rob. Auton. Syst. 59(3–4), 243–255 (2011)

    Article  Google Scholar 

  34. 34.

    Meriçli, Ç., Veloso, M.: Improving biped walk stability using real-time corrective human feedback. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 6556 LNAI, pp. 194–205 (2011)

  35. 35.

    Akrour, R., Schoenauer, M., Sebag, M.: Preference-based policy learning. In: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), Vol. 6911 LNAI, No. PART 1, pp. 12–27 (2011)

  36. 36.

    Akrour, R., Schoenauer, M., Souplet, J.-C., Sebag, M.: Programming by feedback. In: Proceedings of the 31St International Conference on Machine Learning, vol. 32, pp. 1503–1511 (2014)

  37. 37.

    Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: Advances in Neural Information Processing Systems, pp. 4302–4310 (2017)

  38. 38.

    Jain, A., Wojcik, B., Joachims, T., Saxena, A.: Learning trajectory preferences for manipulators via iterative improvement. In: Advances in neural information processing systems, pp. 575–583 (2013)

  39. 39.

    Mitsunaga, N., Smith, C., Kanda, T.: Adapting robot behavior for human – robot interaction. IEEE Trans. Robot. 24(4), 911–916 (2008)

    Article  Google Scholar 

  40. 40.

    Tenorio-Gonzalez, A.C., Morales, E.F., Villaseñor-Pineda, L.: Dynamic reward shaping: training a robot by voice. In: Advances in Artificial Intelligence–IBERAMIA 2010, No. 214262, pp. 483–492. Springer (2010)

  41. 41.

    León, A., Morales, E.F., Altamirano, L., Ruiz, J.R.: Teaching a robot to perform task through imitation and on-line feedback. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pp. 549–556 (2011)

  42. 42.

    Suay, H., Chernova, S.: Effect of human guidance and state space size on interactive reinforcement learning. In: RO-MAN, 2011 IEEE, pp. 1–6 (2011)

  43. 43.

    Pilarski, P.M., Dawson, M.R., Degris, T., Fahimi, F., Carey, J.P., Sutton, R.S.: Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In: IEEE International Conference on Rehabilitation Robotics, vol. 2011, p. 5975338 (2011)

  44. 44.

    Yanik, P.M., Manganelli, J., Merino, J., Threatt, A.L., Brooks, J.O., Green, K.E., Walker, I.D.: A gesture learning interface for simulated robot path shaping with a human teacher. IEEE Trans. Human-Machine Syst. 44(1), 41–54 (2014)

    Article  Google Scholar 

  45. 45.

    Najar, A., Sigaud, O., Chetouani, M.: Training a robot with evaluative feedback and unlabeled guidance signals. In: IEEE International Symposium on Robot and Human Interactive Communication (ROMAN), pp. 261–266 (2016)

  46. 46.

    Knox, W.B., Stone, P.: TAMER: training an agent manually via evaluative reinforcement. In: 2008 7th IEEE International Conference on Development and Learning, pp. 292–297 (2008)

  47. 47.

    Knox, W.B.: Learning from human-generated reward. In: PhD Dissertation, The University of Texas at Austin (2012)

  48. 48.

    Haykin, S.: Neural networks: a comprehensive foundation. Knowl. Eng. Rev. 13, 4 (1999)

    MATH  Google Scholar 

  49. 49.

    Vien, N.A., Ertel, W.: Reinforcement learning combined with human feedback in continuous state and action spaces. In: 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6 (2012)

  50. 50.

    Thomaz, A., Hoffman, G., Breazeal, C.: Reinforcement learning with human teachers: understanding how people want to teach robots. In: Proceedings - IEEE International Workshop on Robot and Human Interactive Communication, pp. 352–357 (2006)

  51. 51.

    Toris, R., Suay, H. B., Chernova, S.: A practical comparison of three robot learning from demonstration algorithms. In: 2012 7th ACM/IEEE International Conference on Human-Robot Interact. (HRI), pp. 261–262 (2012)

  52. 52.

    Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming using Function Approximators, vol. 39. CRC Press (2010)

  53. 53.

    Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Rob. Res. 32, 1238–1274 (2013)

    Article  Google Scholar 

  54. 54.

    Takagi, T., Sugeno, M.: Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. Syst. Man Cybern. 1, 116–132 (1985)

    Article  MATH  Google Scholar 

  55. 55.

    Babuska, R.: Fuzzy and Neural Control. Disc Course Lecture Notes. Delft University Technology, Delft, Netherlands (2001)

  56. 56.

    Rahat, A.A.M.: Matlab implementation of controlling a bicycle using reinforcement learning. https://bitbucket.org/arahat/matlab-implementation-of-controlling-a-bicycle-using (2010)

Download references

Acknowledgements

This work was partially funded by FONDECYT project 1161500 and CONICYT-PCHA/Doctorado Nacional/2015-21151488.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Carlos Celemin.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Appendix

Appendix

Given that human feedback is a key component of the proposed learning framework, a new Hand-Gesture Recognition (HGR) interface that allows providing feedback to the agent is proposed. The interface allows detecting 5 gestures: positive correction, negative correction, a neutral gesture used when users do not need to provide feedback, a reward, and a punishment (see gestures in Fig. 15).

Fig. 15
figure15

Examples of recognized hand gestures

In order for the proposed system to be robust to variations in illumination, colors, and non-uniform backgrounds, it uses: (i) Gaussian Mixture Models (GMM) and based Background Subtraction (BS) to detect regions of interest (ROI), i.e. hand candidates, (ii) Kalman filtering for tracking the hand candidates, (iii) Local Binary Patterns (LBP) as features for characterizing the ROIs, and (iv) SVM classifiers for the final detection of the hand-gestures. The block diagram is shown in Fig. 16. The main functionalities are described in the following paragraphs:

  • Detection of Regions of Interest (ROI): Movement blobs are first detected using background subtraction. Then, adjacent blobs are merged and filtered using morphological filters, and the largest blob is selected as a hand candidate and fed to the tracking system.

    In parallel, a second process applies BS to color edges: First, a binary edge image is computed, and then color information is incorporated into the edges. Afterwards, BS and area filtering is applied in the edge’s domain. Finally, the output of the area-filtering module is intersected with the color edges in the block “&”. In order to manage occlusions properly (see Fig. 16b) the block “&” deletes the blobs associated with the occluded edges, which are labeled by BS as regions with movement (Fig. 15 left); since those edges are not present in the original image. The output is a blob with the detected moving, color edges (Fig. 17 right).

  • Tracking: The parameters of the bounding box of the largest blob taken as a hand candidate by the prior module are used as observations by a Kalman filter, which estimates the final hand candidates, based on the fusion of the current ROI information with the prior ones. Afterwards, the image computed in the block “&” of the previous module is intersected with the Kalman-filtered bounding box. Examples of the resulting images are shown in Fig. 15.

  • Features Extraction and Classification: The image window given by the Tracking module is analyzed in order to classify the captured gesture. Histograms of LBP features are computed inside the image window. Since this window is a binary image, LBP are used as discretized measurements of the gradient. Then, the histograms of the LBP features are similar to Histograms of Gradient (HOG). This feature vector feeds five SVM classifiers, one trained for each gesture, where the gestures are detected.

Fig. 16
figure16

Hand gesture recognition system. a General scheme, b Detailed scheme

Fig. 17
figure17

Hands occluding edges in the edges domain (left), results of the intersection “&” module (right)

The dataset used for training the SVM was built using images generated by the tracking module. Altogether, 1654 images of the five hand-gestures were recorded, 60% of them used for training, and 40% for validation. The classification error is 9.05%, which is considered appropriate to be used as an interface for the learning problems described in Section 4.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Celemin, C., Ruiz-del-Solar, J. An Interactive Framework for Learning Continuous Actions Policies Based on Corrective Feedback. J Intell Robot Syst 95, 77–97 (2019). https://doi.org/10.1007/s10846-018-0839-z

Download citation

Keywords

  • Learning from demonstration
  • Interactive machine learning
  • Human feedback
  • Human teachers
  • Decision making systems