In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinfor...In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.展开更多
In this paper, we discuss Markovian decision programming with recursive vector-reward andgive an algorithm to find optimal policies. We prove that: (1) There is a Markovian optimal policy for the nonstationary case; ... In this paper, we discuss Markovian decision programming with recursive vector-reward andgive an algorithm to find optimal policies. We prove that: (1) There is a Markovian optimal policy for the nonstationary case; (2) Thereis a stationary optimal policy for the stationary case.展开更多
White and Furukawa have discussed vector-valued Markovian decision programming (VMDP). The relations between finite horizon and infinite horizon about VMDP were discussed in [1]. Furukawa generalized the iteration alg...White and Furukawa have discussed vector-valued Markovian decision programming (VMDP). The relations between finite horizon and infinite horizon about VMDP were discussed in [1]. Furukawa generalized the iteration algorithm from the scalar case into the vector case, and gave the method to find all optimal policies. His algorithm is described briefly in the following way: Starting with any stationary policy, we展开更多
文摘In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.
基金The project is supported by National Natural Science Foundation of China
文摘 In this paper, we discuss Markovian decision programming with recursive vector-reward andgive an algorithm to find optimal policies. We prove that: (1) There is a Markovian optimal policy for the nonstationary case; (2) Thereis a stationary optimal policy for the stationary case.
基金Project supported by the National Natural Science Foundation of China.
文摘White and Furukawa have discussed vector-valued Markovian decision programming (VMDP). The relations between finite horizon and infinite horizon about VMDP were discussed in [1]. Furukawa generalized the iteration algorithm from the scalar case into the vector case, and gave the method to find all optimal policies. His algorithm is described briefly in the following way: Starting with any stationary policy, we