Abstract
Once a suitable definition of the system’s belief state is found, the system designer must define how actions are to be taken. The policy, denoted by \(\pi \), is the component which decides the action. Section 2.3 gave a brief overview of established techniques for hand-crafting these decisions. This chapter will discuss algorithms that can be used to automate the decision making process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In the partially observable case \(b\) will be a probability distribution.
- 2.
This is known to be finite because the system is episodic.
- 3.
This approach is no less general than defining arbitrary basis functions, since the set of features can always be defined to include any value that is desired in a particular basis function.
- 4.
Note that this approach to deciding the number of matching venues is inefficient when the database is large. An alternative approach is discussed in Chap. 7.
- 5.
The TownInfo system has \(N_a=28, N_c = 10\), the total number of parameters would therefore be \(28\times (7\times 10+ 4)\).
- 6.
The number of parameters for the inform summary act is unchanged at 74. The number of other parameters is \(7\times 9 + 4 = 67\). The number of remaining parameters for the request, select and confirm summary acts is \(27\times 7 = 189\). The total is therefore 330.
- 7.
The occupancy frequency is also sometimes called the state distribution (Peters et al. 2005).
References
Amari S (1998) Natural gradient works efficiently in learning. Neural Comput 10:251–276
Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Mach Learn 22(1–3):33–57. ISSN 0885–6125
Peters J, Vijayakumar S, Schaal S (2005) Natural actor-critic. In: Proceedings of ECML. Springer, Heidelberg, pp 280–291
Schatzmann J (2008) Statistical user modeling for dialogue systems. Ph.D. thesis, University of Cambridge
Schatzmann J, Thomson B, Weilhammer K, Ye H, Young S (2007) Agenda-based user simulation for bootstrapping a POMDP dialogue system. In: Proceedings of HLT/NAACL
Sutton R, Barto A (1998) Reinforcement learning: an introduction. Adaptive computation and machine learning. MIT Press, Cambridge
Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: NIPS 12. MIT Press, Cambridge, pp 1057–1063
Williams JD, Young S (2005) Scaling up POMDPs for dialog management: the “Summary POMDP" method. In: Proceedings of ASRU, pp 177–182
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 Springer-Verlag London
About this chapter
Cite this chapter
Thomson, B. (2013). Policy Design. In: Statistical Methods for Spoken Dialogue Management. Springer Theses. Springer, London. https://doi.org/10.1007/978-1-4471-4923-1_5
Download citation
DOI: https://doi.org/10.1007/978-1-4471-4923-1_5
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4922-4
Online ISBN: 978-1-4471-4923-1
eBook Packages: EngineeringEngineering (R0)