A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Consideri...A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Considering the different types of service requirements,the MDP model and its reward function are constructed based on the quality of service(QoS)attribute parameters of the mobile users,and the network attribute weights are calculated by using the analytic hierarchy process(AHP).The network handoff decision condition is designed according to the different types of user services and the time-varying characteristics of the network,and the MDP model is solved by using the genetic algorithm and simulated annealing(GA-SA),thus,users can seamlessly switch to the network with the best long-term expected reward value.Simulation results show that the proposed algorithm has good convergence performance,and can guarantee that users with different service types will obtain satisfactory expected total reward values and have low numbers of network handoffs.展开更多
In this paper,we provide a new theoretical framework of pyramid Markov processes to solve some open and fundamental problems of blockchain selfish mining under a rigorous mathematical setting.We first describe a more ...In this paper,we provide a new theoretical framework of pyramid Markov processes to solve some open and fundamental problems of blockchain selfish mining under a rigorous mathematical setting.We first describe a more general model of blockchain selfish mining with both a two-block leading competitive criterion and a new economic incentive mechanism.Then we establish a pyramid Markov process and show that it is irreducible and positive recurrent,and its stationary probability vector is matrix-geometric with an explicitly representable rate matrix.Also,we use the stationary probability vector to study the influence of orphan blocks on the waste of computing resource.Next,we set up a pyramid Markov reward process to investigate the long-run average mining profits of the honest and dishonest mining pools,respectively.As a by-product,we build one-dimensional Markov reward processes and provide some new interesting interpretation on the Markov chain and the revenue analysis reported in the seminal work by Eyal and Sirer(2014).Note that the pyramid Markov(reward)processes can open up a new avenue in the study of blockchain selfish mining.Thus we hope that the methodology and results developed in this paper shed light on the blockchain selfish mining such that a series of promising research can be developed potentially.展开更多
Interest in inverse reinforcement learning (IRL) has recently increased,that is,interest in the problem of recovering the reward function underlying a Markov decision process (MDP) given the dynamics of the system and...Interest in inverse reinforcement learning (IRL) has recently increased,that is,interest in the problem of recovering the reward function underlying a Markov decision process (MDP) given the dynamics of the system and the behavior of an expert.This paper deals with an incremental approach to online IRL.First,the convergence property of the incremental method for the IRL problem was investigated,and the bounds of both the mistake number during the learning process and regret were provided by using a detailed proof.Then an online algorithm based on incremental error correcting was derived to deal with the IRL problem.The key idea is to add an increment to the current reward estimate each time an action mismatch occurs.This leads to an estimate that approaches a target optimal value.The proposed method was tested in a driving simulation experiment and found to be able to efficiently recover an adequate reward function.展开更多
In this work, for a control consumption-investment process with the discounted reward optimization criteria, a numerical estimate of the stability index is made. Using explicit formulas for the optimal stationary poli...In this work, for a control consumption-investment process with the discounted reward optimization criteria, a numerical estimate of the stability index is made. Using explicit formulas for the optimal stationary policies and for the value functions, the stability index is explicitly calculated and through statistical techniques its asymptotic behavior is investigated (using numerical experiments) when the discount coefficient approaches 1. The results obtained define the conditions under which an approximate optimal stationary policy can be used to control the original process.展开更多
This paper considers the variance optimization problem of average reward in continuous-time Markov decision process (MDP). It is assumed that the state space is countable and the action space is Borel measurable space...This paper considers the variance optimization problem of average reward in continuous-time Markov decision process (MDP). It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy space. Unlike the traditional Markov decision process, the cost function in the variance criterion will be affected by future actions. To this end, we convert the variance minimization problem into a standard (MDP) by introducing a concept called pseudo-variance. Further, by giving the policy iterative algorithm of pseudo-variance optimization problem, the optimal policy of the original variance optimization problem is derived, and a sufficient condition for the variance optimal policy is given. Finally, we use an example to illustrate the conclusion of this paper.展开更多
基金partially supported by Nation Science Foundation of China (61661025, 61661026)Foundation of A hundred Youth Talents Training Program of Lanzhou Jiaotong University (152022)
文摘A network selection optimization algorithm based on the Markov decision process(MDP)is proposed so that mobile terminals can always connect to the best wireless network in a heterogeneous network environment.Considering the different types of service requirements,the MDP model and its reward function are constructed based on the quality of service(QoS)attribute parameters of the mobile users,and the network attribute weights are calculated by using the analytic hierarchy process(AHP).The network handoff decision condition is designed according to the different types of user services and the time-varying characteristics of the network,and the MDP model is solved by using the genetic algorithm and simulated annealing(GA-SA),thus,users can seamlessly switch to the network with the best long-term expected reward value.Simulation results show that the proposed algorithm has good convergence performance,and can guarantee that users with different service types will obtain satisfactory expected total reward values and have low numbers of network handoffs.
基金This work is supported by the National Key R&D Program of China under Grant No.2020AAA0103801Quanlin Li is supported by the National Natural Science Foundation of China under Grant Nos.71671158 and 71932002+1 种基金the Beijing Social Science Foundation Research Base Project under Grant No.19JDGLA004Xiaole Wu is supported by the National Natural Science Foundation of China under Grant No.72025102.
文摘In this paper,we provide a new theoretical framework of pyramid Markov processes to solve some open and fundamental problems of blockchain selfish mining under a rigorous mathematical setting.We first describe a more general model of blockchain selfish mining with both a two-block leading competitive criterion and a new economic incentive mechanism.Then we establish a pyramid Markov process and show that it is irreducible and positive recurrent,and its stationary probability vector is matrix-geometric with an explicitly representable rate matrix.Also,we use the stationary probability vector to study the influence of orphan blocks on the waste of computing resource.Next,we set up a pyramid Markov reward process to investigate the long-run average mining profits of the honest and dishonest mining pools,respectively.As a by-product,we build one-dimensional Markov reward processes and provide some new interesting interpretation on the Markov chain and the revenue analysis reported in the seminal work by Eyal and Sirer(2014).Note that the pyramid Markov(reward)processes can open up a new avenue in the study of blockchain selfish mining.Thus we hope that the methodology and results developed in this paper shed light on the blockchain selfish mining such that a series of promising research can be developed potentially.
基金Project (No.90820306) supported by the National Natural Science Foundation of China
文摘Interest in inverse reinforcement learning (IRL) has recently increased,that is,interest in the problem of recovering the reward function underlying a Markov decision process (MDP) given the dynamics of the system and the behavior of an expert.This paper deals with an incremental approach to online IRL.First,the convergence property of the incremental method for the IRL problem was investigated,and the bounds of both the mistake number during the learning process and regret were provided by using a detailed proof.Then an online algorithm based on incremental error correcting was derived to deal with the IRL problem.The key idea is to add an increment to the current reward estimate each time an action mismatch occurs.This leads to an estimate that approaches a target optimal value.The proposed method was tested in a driving simulation experiment and found to be able to efficiently recover an adequate reward function.
文摘In this work, for a control consumption-investment process with the discounted reward optimization criteria, a numerical estimate of the stability index is made. Using explicit formulas for the optimal stationary policies and for the value functions, the stability index is explicitly calculated and through statistical techniques its asymptotic behavior is investigated (using numerical experiments) when the discount coefficient approaches 1. The results obtained define the conditions under which an approximate optimal stationary policy can be used to control the original process.
文摘This paper considers the variance optimization problem of average reward in continuous-time Markov decision process (MDP). It is assumed that the state space is countable and the action space is Borel measurable space. The main purpose of this paper is to find the policy with the minimal variance in the deterministic stationary policy space. Unlike the traditional Markov decision process, the cost function in the variance criterion will be affected by future actions. To this end, we convert the variance minimization problem into a standard (MDP) by introducing a concept called pseudo-variance. Further, by giving the policy iterative algorithm of pseudo-variance optimization problem, the optimal policy of the original variance optimization problem is derived, and a sufficient condition for the variance optimal policy is given. Finally, we use an example to illustrate the conclusion of this paper.