New Learning Paradigms in Soft Computing pp 181-230 | Cite as

# Reinforcement Learning for Fuzzy Agents: Application to a Pighouse Environment Control

## Abstract

Fuzzy Actor-Critic Learning (FACL) and Fuzzy Q-Learning (FQL) are reinforcement learning methods based on Dynamic Programming (DP) principles. In this chapter, they are used to tune on line the conclusion part of Fuzzy Inference Systems (FIS). The only information available for learning is the system feedback, which describes in terms of reward and punishment the task the fuzzy agent has to realize. At each time step, the agent receives a reinforcement signal according to the last action it has performed in the previous state. The problem involves optimizing not only the direct reinforcement, but also the total amount of reinforcements the agent can receive in the future. To illustrate the use of these two learning methods, we first applied them to a problem in which we have to find a fuzzy controller to drive a boat from one bank to another, across a river with a strong non-linear current. Then, we used the well-known Cart-Pole Balancing and Mountain-Car problems to be able to compare our methods to other reinforcement learning methods, and focus on important characteristic aspects of FACL and FQL. The experimental studies had shown the superiority of these methods with respect to the other related methods we can find in the literature. We also found that our generic methods allow us to learn every kind of reinforcement learning problem (continuous states, discrete/continuous actions, various type of reinforcement functions). Thanks to this flexibility, these learning methods have been applied successfully in an industrial problem, to discover a policy for pighouse environment control.

## Keywords

Learning Rate Reinforcement Learning Fuzzy Controller Fuzzy Inference System Discrete Action## Preview

Unable to display preview. Download preview PDF.

## References

- 1.Baird, L. and Klopf, A. (1993), “Reinforcement learning with high-dimensional, continuous actions,”
*Technical Report WL-TR-93–1147*, Wright-Pattersson Air Force Base, Wright Laboratory, Ohio.Google Scholar - 2.Barto, A. (1989), “Connectionist learning for control: An overview,”
*Technical Report 89–89*, COINS, University of Massachusetts.Google Scholar - 3.Barto, A. and Anderson, C. (1985), “Structural learning in connectionist systems,”
*Proceedings of the Seventh Annual Conference of the Cognitive Science Society*.Google Scholar - 4.Barto, A. and Sutton, R. (1981), “Goal seeking components for adaptive intelligence: An initial assessment,”
*Technical Report AFWAL-TR-81–1070*, Air Force Wright Aeronautical Laboratories/Avionics Laboratories, Ohio.Google Scholar - 5.Barto, A., Sutton, R., and Anderson, C. (1983), “Neuronlike adaptive elements that can solve difficult learning control problems,”
*IEEE Transactions on Systems, Man and Cybernetics*, vol. SMC-13, no. 5, pp. 834–846.CrossRefGoogle Scholar - 6.Barto, A., Sutton, R., and Watkins, C. (1990),
*Learning and Computational Neuroscience: Foundations of Adaptive Networks*, chapter Learning and Sequential Decision Making, MIT Press, Cambridge, pp. 539–602.Google Scholar - 7.Bellman, R. (1957),
*Dynamic Programming*, Princeton University Press, Princeton, NJ.MATHGoogle Scholar - 8.Berenji, H. (1991), “Refinement of approximate reasoning-based controllers by reinforcement learning,”
*Proceedings of the Eighth International Workshop on Machine Learning*, pp. 475–479.Google Scholar - 9.Berenji, H., Chen, Y., Lee, C., Jang, J., and Murugesan, S. (1990), “A hierarchical approach to designing approximate reasoning-based controllers for dynamic physical systems,”
*Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence*, pp. 362–369.Google Scholar - 10.Berenji, H. and Khedkar, P. (1992), “Learning and tuning fuzzy logic controllers through reinforcements,”
*IEEE Transactions on Neural Networks*, vol. 3, no. 5, pp. 724–740.CrossRefGoogle Scholar - 11.Bertsekas, D. (1987),
*Dynamic Programming: Deterministic and Stochastic Models*, Prentice Hall, Englewood Cliffs, NJ.MATHGoogle Scholar - 12.Boyan, J. and Moore, A. (1995), “Generalization in reinforcement learning: Safely approximating the value function,”
*Advances in Neural Information Processing Systems*, Tesauro, G., Touretzky, D., and Leen, T., Eds, Cambridge M.A, MIT Press, vol. 7.Google Scholar - 13.Carse, B. and Fogarty, T. (1994),
*Parallel Problem Solving from Nature*, chapter A Fuzzy Classifier System using the Pittsburg Approach, pp. 260–269, Springer-Verlag.Google Scholar - 14.Cichosz, P. (1995), “Truncating temporal differences: On the efficient implementation of TD(a) for reinforcement learning,”
*Journal of Artificial Intelligence Research*, vol. 2, pp. 287–318.Google Scholar - 15.Cordon, O. and Herrera, F. (1996),
*Genetic Algorithms and Soft Computing*, chapter A Hybrid Genetic Algorithm-Evolution Strategy Process for Learning Fuzzy Logic Controller Knowledge Bases, pp. 251–278, Physica-Verlag.Google Scholar - 16.Dayan, P. and Sejnowshi, T. (1993), “TD(A) converges with probability 1,”
*Machine Learning*, vol. 14, pp. 295–301.Google Scholar - 17.Dutertre, C., Jouffe, L., Vaudelet, J., and Rousseau, P. (1997), “Incidence du réglage de la ventilation sur les paramètres d’ambiance d’une porcherie d’engraissement. Résultats issus de la modlisation thermique d’un bâtiment d’élevage,”
*Techni-Porc*, 20.1.97, pp. 13–24.Google Scholar - 18.Glorennec, P. (1994), “Fuzzy Q-learning and dynamical Fuzzy Q-learning,”
*Proceedings of FUZZ-IEEE’94, Third International Conference on Fuzzy Systems*, Orlando, vol. 1, pp. 474–479.Google Scholar - 19.Glorennec, P. (1994), “Fuzzy Q-learning and evolutionary strategy for adaptive fuzzy control,”
*Proceedings of EUFIT’94, Second European Congress on Intelligent Techniques and Soft Computing*, Aachen, Germany, vol. 1, pp. 35–40.Google Scholar - 20.Glorennec, P. (1996),
*Genetic Algorithms and Soft Computing*, chapter Constrained Optimization of FIS using an Evolutionary Method, pp. 349–368. Physica-Verlag.Google Scholar - 21.Glorennec, P. and Jouffe, L. (1996), “A reinforcement learning method for an autonomous robot,”
*Proceedings of EUFIT’96, Fourth European Congress on Intelligent Techniques and Soft Computing*, Aachen, Germany, pp. 1100–1104.Google Scholar - 22.Gullapalli, V. (1990), “A stochastic reinforcement learning algorithm for learning real-valued functions,”
*Neural Networks*, vol. 3, pp. 671–692.CrossRefGoogle Scholar - 23.Gullapalli, V. (1993), “Robust control under extreme uncertainty,”
*Advances in Neural Information Processing Systems*, Giles, C. L., Hanson, S. J., and Cowan, J. D., Eds, San Mateo, Morgan Kauffmann, vol. 5, pp. 327–334.Google Scholar - 24.Horikawa, S., Furuhashi, T., Okuma, S., and Uchikawa, Y. (1990), “A fuzzy controller using a neural network and its capability to learn expert’s control rules,”
*Proceedings of the International Conference on Fuzzy Logic and Neural Networks*, Iizuka, Japan, pp. 103–106.Google Scholar - 25.Jaakkola, T., Jordan, M., and Singh, S. (1994), “On the convergence of stochastic iterative dynamic programming algorithms,”
*Neural Computation*, vol. 6, no. 6, pp. 1185–1201.MATHCrossRefGoogle Scholar - 26.Jacobs, R. (1988), “Increased rates of convergence through learning rate adaptation,”
*Neural Networks*, vol. 1, pp. 295–307.CrossRefGoogle Scholar - 27.Jordan, M. (1992), “Forward models: Supervised learning with a distal teacher,”
*Cognitive Science*, vol. 16, pp. 307–354.CrossRefGoogle Scholar - 28.Jouffe, L. (1996), “Actor-critic learning based on fuzzy inference system,”
*Proceedings of the IEEE International Conference on Systems, Man and Cybernetics*, Beijing, China, vol. 1, pp. 339–344.Google Scholar - 29.Jouffe, L. (1997),
*Fuzzy Inference Systems Learning by Reinforcement Methods: Application to a Pig House Atmosphere Control*, PhD thesis, University of Rennes I (in French).Google Scholar - 30.Jouffe, L. (1997), “Ventilation control learning with FACL,”
*Proceedings of FUZZ-IEEE’97, Sixth International Conference on Fuzzy Systems*, Barcelona, Spain, pp. 1719–1724.CrossRefGoogle Scholar - 31.Jouffe, L. and Glorennec, P. (1996), “Comparison between connectionist and fuzzy Q-learning,”
*Proceedings of IIZUKA’96, Fourth International Conference on Soft Computing*, Iizuka, Fukuoka, Japan, vol. 2, pp. 557–560.Google Scholar - 32.Kaelbling, L. (1994), “Associative reinforcement learning: Functions in kDNF,”
*Machine Learning*, vol. 15, no. 3, pp. 279–298.MATHGoogle Scholar - 33.Kaelbling, L., Littman, M., and Moore, A. (1996), “Reinforcement learning: A survey,”
*Journal of Artificial Intelligence Research*, vol. 4, pp. 237–285.Google Scholar - 34.Karr, C. (1991), “Applying genetic algorithms to fuzzy logics,”
*AI Expert*, vol. 6, pp. 38–43.MathSciNetGoogle Scholar - 35.Kesten, H. (1958), “Accelerated stochastic approximation,”
*Annals of Mathematical Statistics*, vol. 29, pp. 41–59.MathSciNetMATHCrossRefGoogle Scholar - 36.Klopf, A. (1972), `Brain function and adaptive systems - a heterostatic theory,“
*Technical Report AFCRL-72–0164*, Air Force Cambridge Research Laboratories, Bedford, MA.Google Scholar - 37.Lee, C. (1990), “Fuzzy logic in control systems: Fuzzy logic controller — Part I,”
*IEEE Transactions on Systems, Man, and Cybernetics*, vol. 20, no. 2, pp. 404–418.MATHCrossRefGoogle Scholar - 38.Lee, C. (1990), “Fuzzy logic in control systems: Fuzzy logic controller — Part II,”
*IEEE Transactions on Systems, Man and Cybernetics*, vol. 20, no. 2, pp. 419–435.MATHCrossRefGoogle Scholar - 39.Lee, C. (1991), “A self learning rule-based controller employing approximate reasoning and neural net concepts,”
*International Journal of Intelligent Systems*, vol. 6, pp. 71–93.CrossRefGoogle Scholar - 40.Lin, C. and Lee, C. (1991), “Neural-network-based fuzzy logic control and decision system,”
*IEEE Transactions on Computers*, vol. 40, no. 12, pp. 1320–1336.MathSciNetCrossRefGoogle Scholar - 41.Lin, L. (1991), “Self-improvement based on reinforcement learning, planning and teaching,”
*Proceedings of the Eighth International Workshop on Machine Learning*, pp. 323–327.Google Scholar - 42.Lin, L. (1992), “Self-improving reactive agents based on reinforcement learning, planning and teaching,”
*Machine Learning*, vol. 8, no. 3, pp. 293–321.Google Scholar - 43.Michie, D. and Chambers, R. (1968), “”boxes“: An experiment in adaptive control,”
*Machine Intelligence*, vol. 2, pp. 137–152.MATHGoogle Scholar - 44.Mitchell, T. and Thrun, S. (1993), “Explanation-based neural network learning for robot control,”
*Advances in Neural Information Processing Systems*, vol. 5, San Mateo. Morgan KaufmannGoogle Scholar - 45.Moore, A. (1991), “Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces,”
*Proceedings of the Eighth International Conference on Machine Learning*, Birnbaum, L. and Collins, G., Eds, pp. 333–337.Google Scholar - 46.Nomura, H., Hayashi, I., and Wakami, N. (1991), “A self-tuning method of fuzzy control by descent method,”
*Proceedings of IFSA’91, First International Fuzzy Systems Association World Congress*, vol. Engineering, pp. 155–158.Google Scholar - 47.Parodi, A. and Bonelli, P. (1993), “A new approach to fuzzy classifier system,”
*Proceedings of the Fifth International Conference on Genetic Algorithms*, San Mateo, CA, pp. 223–230.Google Scholar - 48.Peng, J. and Williams, R. (1994), “Incremental multi-step Q-learning,”
*Proceedings of the Eleventh International Conference on Machine Learning*, Rutgers University In New Brunswick, pp. 226–232.Google Scholar - 49.Piché, S. (1994), “Steepest descent algorithms for neural networks controllers and filters,”
*IEEE Transactions on Neural Networks*, vol. 5, no. 2, pp. 198–212.CrossRefGoogle Scholar - 50.Puterman, M. (1994),
*Markov Decision Processes — Discrete Stochastic Dynamic Programming*, John Wiley and Sons, New York, NY.MATHCrossRefGoogle Scholar - 51.Rummery, G. and Niranjan, M. (1994), “On-line Q-learning using connectionist systems,”
*Technical Report CUED/F-INFENG/TR**166*, Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, England.Google Scholar - 52.Saridis, G. (1970), “Learning applied to successive approximation algorithms,”
*IEEE Transactions on Systems Science and Cybernetics*, vol. SSC-6, pp. 97–103.CrossRefGoogle Scholar - 53.Singh, S. and Sutton, R. (1996), “Reinforcement learning with replacing eligibility traces,”
*Machine Learning*, vol. 22, no. 1, pp. 123–158.MATHGoogle Scholar - 54.Skinner, B. (1953),
*Science and Human Behavior*, Macmillan, New York.Google Scholar - 55.Sutton, R. (1984),
*Temporal Credit Assignment in Reinforcement Learning*, PhD thesis, University of Massachusetts, Amherst, MA.Google Scholar - 56.Sutton, R. (1988), “Learning to predict by the methods of temporal differences,”
*Machine Learning*, vol. 3, pp. 9–44.Google Scholar - 57.Sutton, R. (1990), “Integrated architectures for learning, planning, and reacting based on approximating dynamic programming,”
*Proceedings of the Seventh International Conference on Machine Learning*, San Mateo, pp. 216–224.Google Scholar - 58.Sutton, R. (1996), “Generalization in reinforcement learning: Successful examples using sparse coarse coding,”
*Advances in Neural Information Processing Systems*, Tesauro, G., Tourezky, D., and Leen, T., Eds, Cambridge M.A, MIT Press, vol. 8, pp. 1038–1044.Google Scholar - 59.Takagi, T. and Sugeno, M. (1985), “Fuzzy identification of systems and its applications to modeling and control,”
*IEEE Transactions on Systems, Man, and Cybernetics*, vol. 5, no. 1, pp. 116–132.CrossRefGoogle Scholar - 60.Tesauro, G. (1992), “Practical issues in temporal difference learning,”
*Machine Learning*, vol. 8, pp. 257–277.MATHGoogle Scholar - 61.Thrun, S. (1992), “Efficient exploration in reinforcement learning,”
*Technical Report CMU-CS-92–102*, Carnegie-Mellon University, Pittsburgh.Google Scholar - 62.Thrun, S. and Möller, K. (1992), “Active exploration in dynamic environments,”
*Advances in Neural Information Processing Systems*, vol. 4.Google Scholar - 63.TrnSys (1983),
*A Transient System Simulation Program*, University of Wisconsin, Madison, USA.Google Scholar - 64.Watkins, C. (1989),
*Learning from Delayed Rewards*, PhD thesis, Cambridge University, Cambridge, England.Google Scholar - 65.Widrow, B. and Hoff, M. (1960), “Adaptive switching circuit,”
*1960 Wescon Convention Record; Part IV*, pp. 96–104. Reprinted in J.A. Anderson and E. Rosenfeld, Neuro-Computing: Foundations of Research, MIT Press, Cambridge, MA, 1988.Google Scholar - 66.Williams, R. and Baird, L. (1993), “Analysis of some incremental variants of policy iteration: First steps toward understanding actor-critic learning systems,”
*Technical Report NU-CCS-93–11*, Northeastern University, College of Computer Science.Google Scholar