Policy Design

Thomson, Blaise

doi:10.1007/978-1-4471-4923-1_5

Blaise Thomson²

Part of the book series: Springer Theses ((Springer Theses))

697 Accesses

Abstract

Once a suitable definition of the system’s belief state is found, the system designer must define how actions are to be taken. The policy, denoted by \(\pi \), is the component which decides the action. Section 2.3 gave a brief overview of established techniques for hand-crafting these decisions. This chapter will discuss algorithms that can be used to automate the decision making process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In the partially observable case \(b\) will be a probability distribution.
2.
This is known to be finite because the system is episodic.
3.
This approach is no less general than defining arbitrary basis functions, since the set of features can always be defined to include any value that is desired in a particular basis function.
4.
Note that this approach to deciding the number of matching venues is inefficient when the database is large. An alternative approach is discussed in Chap. 7.
5.
The TownInfo system has \(N_a=28, N_c = 10\), the total number of parameters would therefore be \(28\times (7\times 10+ 4)\).
6.
The number of parameters for the inform summary act is unchanged at 74. The number of other parameters is \(7\times 9 + 4 = 67\). The number of remaining parameters for the request, select and confirm summary acts is \(27\times 7 = 189\). The total is therefore 330.
7.
The occupancy frequency is also sometimes called the state distribution (Peters et al. 2005).

References

Amari S (1998) Natural gradient works efficiently in learning. Neural Comput 10:251–276
Article Google Scholar
Bradtke SJ, Barto AG (1996) Linear least-squares algorithms for temporal difference learning. Mach Learn 22(1–3):33–57. ISSN 0885–6125
Google Scholar
Peters J, Vijayakumar S, Schaal S (2005) Natural actor-critic. In: Proceedings of ECML. Springer, Heidelberg, pp 280–291
Google Scholar
Schatzmann J (2008) Statistical user modeling for dialogue systems. Ph.D. thesis, University of Cambridge
Google Scholar
Schatzmann J, Thomson B, Weilhammer K, Ye H, Young S (2007) Agenda-based user simulation for bootstrapping a POMDP dialogue system. In: Proceedings of HLT/NAACL
Google Scholar
Sutton R, Barto A (1998) Reinforcement learning: an introduction. Adaptive computation and machine learning. MIT Press, Cambridge
Google Scholar
Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: NIPS 12. MIT Press, Cambridge, pp 1057–1063
Google Scholar
Williams JD, Young S (2005) Scaling up POMDPs for dialog management: the “Summary POMDP" method. In: Proceedings of ASRU, pp 177–182
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering, University of Cambridge, Cambridge, CB21PZ, UK
Blaise Thomson

Authors

Blaise Thomson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Blaise Thomson .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Thomson, B. (2013). Policy Design. In: Statistical Methods for Spoken Dialogue Management. Springer Theses. Springer, London. https://doi.org/10.1007/978-1-4471-4923-1_5

Download citation

DOI: https://doi.org/10.1007/978-1-4471-4923-1_5
Published: 08 January 2013
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4922-4
Online ISBN: 978-1-4471-4923-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics