摘要
激励学习智能体通过最优策略的学习与规划来求解序贯决策问题 ,因此如何定义策略的最优判据是激励学习研究的核心问题之一。本文讨论了一系列来自动态规划的最优判据 ,通过实例检验了各种判据对激励学习的适用性和优缺点 。
RL agents solve sequential decision problems by learning optim policies for choosing actions.Thus,at the core of RL is the definition of what it means for a policy to be “optimal”.In this paper,a variety of optimality criteria from the dynamic programming literature are discussed,and their suitability and characteristics for RL is examined through some examples.The necessity of devising RL algorithms for the various criteria has also been analyzed.
出处
《计算机工程与科学》
CSCD
2001年第2期62-65,共4页
Computer Engineering & Science