Based on option-critic algorithm,a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent's performance against opponent with fixed offensive algorithm.A...Based on option-critic algorithm,a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent's performance against opponent with fixed offensive algorithm.An option network is introduced in upper level design,which can generate activated signal from defensive and of-fensive strategies according to temporary situation.Then the lower level executive layer can figure out interactive action with guidance of activated signal,and the value of both activated signal and interactive action is evaluated by critic structure together.This method could release requirement of semi Markov decision process effectively and eventually simplified network structure by eliminating termination possibility layer.According to the result of experiment,it is proved that new algorithm switches strategy style between offensive and defensive ones neatly and acquires more reward from environment than classical deep deterministic policy gradient algorithm does.展开更多
时间抽象是分层强化学习中的重要研究方向,而子目标是时间抽象形成的核心元素.目前,大部分分层强化学习需要人工给出子目标或设定子目标数量.然而,在很多情况下,这不仅需要大量的人工干预,而且所作设定未必适合对应场景,在动态环境未知...时间抽象是分层强化学习中的重要研究方向,而子目标是时间抽象形成的核心元素.目前,大部分分层强化学习需要人工给出子目标或设定子目标数量.然而,在很多情况下,这不仅需要大量的人工干预,而且所作设定未必适合对应场景,在动态环境未知的指导下,这一问题尤为突出.针对此,提出基于优化子目标数的Option-Critic算法(Option-Critic algorithm based on Sub-goal Quantity Optimization,OC-SQO),增加了智能体对环境的探索部分,通过与环境的简单交互,得到适用于应用场景的初始子目标数量估值,并在此基础上识别子目标,然后利用通过策略梯度生成对应的抽象,使用初态、内部策略和终止函数构成的三元组表示,以此进行训练,根据交互得到的抽象改变当前状态,不断迭代优化.OC-SQO算法可以在任意状态下开始执行,不要求预先指定子目标和参数,在执行过程中使用策略梯度生成内部策略、抽象间策略和终止函数,不需要提供内部奖赏信号,也无需获取子目标的情况,尽可能地减少了人工干预.实验验证了算法的有效性.展开更多
Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically ...Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically among the passing trails. By analyzing the agent's actions in the trails, useful heuristics can be found. Not only does the agent pass subgoals more frequently, but also its effective actions are restricted in subgoals. As a consequence, the subgoals can be deemed as the most matching action-restricted states in the paths. In the grid-world environment, the concept of the unique-direction value reflecting the action-restricted property was introduced to find the most matching action-restricted states. The unique-direction-value (UDV) approach is chosen to form options offline and online autonomically. Experiments show that the approach can find subgoals correctly. Thus the Q-learning with options found on both offline and online process can accelerate learning significantly.展开更多
Option pricing has become one of the quite important parts of the financial market. As the market is always dynamic, it is really difficult to predict the option price accurately. For this reason, various machine lear...Option pricing has become one of the quite important parts of the financial market. As the market is always dynamic, it is really difficult to predict the option price accurately. For this reason, various machine learning techniques have been designed and developed to deal with the problem of predicting the future trend of option price. In this paper, we compare the effectiveness of Support Vector Machine (SVM) and Artificial Neural Network (ANN) models for the prediction of option price. Both models are tested with a benchmark publicly available dataset namely SPY option price-2015 in both testing and training phases. The converted data through Principal Component Analysis (PCA) is used in both models to achieve better prediction accuracy. On the other hand, the entire dataset is partitioned into two groups of training (70%) and test sets (30%) to avoid overfitting problem. The outcomes of the SVM model are compared with those of the ANN model based on the root mean square errors (RMSE). It is demonstrated by the experimental results that the ANN model performs better than the SVM model, and the predicted option prices are in good agreement with the corresponding actual option prices.展开更多
In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinfo...In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinforcement learning with automatic sub-goal identification via computer vision(HADS)which takes advantage of hierarchical reinforcement learning to alleviate the sparse reward problem and improve efficiency of exploration by utilizing a sub-goal mechanism.HADS uses a computer vision method to identify sub-goals automatically for hierarchical deep reinforcement learning.Due to the fact that not all sub-goal points are reachable,a mechanism is proposed to remove unreachable sub-goal points so as to further improve the performance of the algorithm.HADS involves contour recognition to identify sub-goals from the state image where some salient states in the state image may be recognized as sub-goals,while those that are not will be removed based on prior knowledge.Our experiments verified the effect of the algorithm.展开更多
基金the National Natural Science Foundation of China (No.61673265)the National Key Research and Development Program (No.2020YFC1512203)the Shanghai Commercial Aircraft System Engineering Joint Research Fund (No.CASEF-2022-Z05)。
文摘Based on option-critic algorithm,a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent's performance against opponent with fixed offensive algorithm.An option network is introduced in upper level design,which can generate activated signal from defensive and of-fensive strategies according to temporary situation.Then the lower level executive layer can figure out interactive action with guidance of activated signal,and the value of both activated signal and interactive action is evaluated by critic structure together.This method could release requirement of semi Markov decision process effectively and eventually simplified network structure by eliminating termination possibility layer.According to the result of experiment,it is proved that new algorithm switches strategy style between offensive and defensive ones neatly and acquires more reward from environment than classical deep deterministic policy gradient algorithm does.
文摘时间抽象是分层强化学习中的重要研究方向,而子目标是时间抽象形成的核心元素.目前,大部分分层强化学习需要人工给出子目标或设定子目标数量.然而,在很多情况下,这不仅需要大量的人工干预,而且所作设定未必适合对应场景,在动态环境未知的指导下,这一问题尤为突出.针对此,提出基于优化子目标数的Option-Critic算法(Option-Critic algorithm based on Sub-goal Quantity Optimization,OC-SQO),增加了智能体对环境的探索部分,通过与环境的简单交互,得到适用于应用场景的初始子目标数量估值,并在此基础上识别子目标,然后利用通过策略梯度生成对应的抽象,使用初态、内部策略和终止函数构成的三元组表示,以此进行训练,根据交互得到的抽象改变当前状态,不断迭代优化.OC-SQO算法可以在任意状态下开始执行,不要求预先指定子目标和参数,在执行过程中使用策略梯度生成内部策略、抽象间策略和终止函数,不需要提供内部奖赏信号,也无需获取子目标的情况,尽可能地减少了人工干预.实验验证了算法的有效性.
基金supported by the National Basic Research Program of China (2013CB329603)the National Natural Science Foundation of China (61375058, 71231002)+1 种基金the China Mobile Research Fund (MCM 20130351)the Ministry of Education of China and the Special Co-Construction Project of Beijing Municipal Commission of Education
文摘Option is a promising method to discover the hierarchical structure in reinforcement learning (RL) for learning acceleration. The key to option discovery is about how an agent can find useful subgoals autonomically among the passing trails. By analyzing the agent's actions in the trails, useful heuristics can be found. Not only does the agent pass subgoals more frequently, but also its effective actions are restricted in subgoals. As a consequence, the subgoals can be deemed as the most matching action-restricted states in the paths. In the grid-world environment, the concept of the unique-direction value reflecting the action-restricted property was introduced to find the most matching action-restricted states. The unique-direction-value (UDV) approach is chosen to form options offline and online autonomically. Experiments show that the approach can find subgoals correctly. Thus the Q-learning with options found on both offline and online process can accelerate learning significantly.
文摘Option pricing has become one of the quite important parts of the financial market. As the market is always dynamic, it is really difficult to predict the option price accurately. For this reason, various machine learning techniques have been designed and developed to deal with the problem of predicting the future trend of option price. In this paper, we compare the effectiveness of Support Vector Machine (SVM) and Artificial Neural Network (ANN) models for the prediction of option price. Both models are tested with a benchmark publicly available dataset namely SPY option price-2015 in both testing and training phases. The converted data through Principal Component Analysis (PCA) is used in both models to achieve better prediction accuracy. On the other hand, the entire dataset is partitioned into two groups of training (70%) and test sets (30%) to avoid overfitting problem. The outcomes of the SVM model are compared with those of the ANN model based on the root mean square errors (RMSE). It is demonstrated by the experimental results that the ANN model performs better than the SVM model, and the predicted option prices are in good agreement with the corresponding actual option prices.
基金supported by the National Natural Science Foundation of China(61303108)Suzhou Key Industries Technological Innovation-Prospective Applied Research Project(SYG201804)+2 种基金A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)the Fundamental Research Funds for the Gentral UniversitiesJLU(93K172020K25)。
文摘In reinforcement learning an agent may explore ineffectively when dealing with sparse reward tasks where finding a reward point is difficult.To solve the problem,we propose an algorithm called hierarchical deep reinforcement learning with automatic sub-goal identification via computer vision(HADS)which takes advantage of hierarchical reinforcement learning to alleviate the sparse reward problem and improve efficiency of exploration by utilizing a sub-goal mechanism.HADS uses a computer vision method to identify sub-goals automatically for hierarchical deep reinforcement learning.Due to the fact that not all sub-goal points are reachable,a mechanism is proposed to remove unreachable sub-goal points so as to further improve the performance of the algorithm.HADS involves contour recognition to identify sub-goals from the state image where some salient states in the state image may be recognized as sub-goals,while those that are not will be removed based on prior knowledge.Our experiments verified the effect of the algorithm.