## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Gradient Temporal-Difference Learning with Regularized Corrections

ICML, pp.3524-3534, (2020)

EI

关键词

摘要

It is still common to use Q-learning and temporal difference (TD) learning-even though they have divergence issues and sound Gradient TD alternatives exist-because divergence seems rare and they typically perform well. However, recent work with large neural network learning systems reveals that instability is more common than previously...更多

代码：

数据：

简介

- Off-policy learning—the ability to learn the policy or value function for one policy while following another—underlies many practical implementations of reinforcement learning.
- Many systems use experience replay, where the value function is updated using previous experiences under many different policies.
- One of the most widely-used algorithms, Q-learning—a temporal difference (TD) algorithm—is off-policy by design: updating toward the maximum value action in the current state, regardless of which action the agent selected.
- Based on the agent’s action At and the transition dynamics, P : S × A × S → [0, 1], the environment transitions into a new state, St+1, and emits a scalar reward Rt+1.

重点内容

- Off-policy learning—the ability to learn the policy or value function for one policy while following another—underlies many practical implementations of reinforcement learning
- In this paper we introduce a new Gradient temporal difference method, called temporal difference with Regularized Corrections (TDRC)
- We demonstrate that TD with Corrections frequently outperforms the saddlepoint variant of Gradient temporal difference, motivating why we build on TD with Corrections and the utility of being able to shift between temporal difference and TD with Corrections by setting the regularization parameter
- We introduced a simple modification of the TD with Corrections algorithm that achieves performance much closer to that of temporal difference
- TD with Regularized Corrections is built on TD with Corrections, and, as we prove, inherits its soundness guarantees
- With extensions to non-linear function approximation, we find that the resulting algorithm, QRC, performs as well as Q-learning and in some cases notably better

方法

**Experiments in the Prediction Setting**

The authors first establish the performance of TDRC across several small linear prediction tasks where the authors can carefully sweep hyper-parameters, analyze sensitivity, and average over many runs.- The first problem, Boyan’s chain (Boyan, 2002), is a 13 state Markov chain where each state is represented by a compact feature representation.
- Like TD, TDRC was developed for prediction, under linear function approximation.
- Like TD, there are natural— though in some cases heuristic—extensions to the control setting and to non-linear function approximation.
- The authors first investigate TDRC in control with linear function approximation, where the extension is more straightforward.
- For the first time, that gradient TD methods can outperform Q-learning when using neural networks, in two classic control domains and two visual games

结果

- The authors introduced a simple modification of the TDC algorithm that achieves performance much closer to that of TD.

结论

- The authors introduced a simple modification of the TDC algorithm that achieves performance much closer to that of TD.
- With extensions to non-linear function approximation, the authors find that the resulting algorithm, QRC, performs as well as Q-learning and in some cases notably better.
- This constitutes the first demonstration of Gradient-TD methods outperforming Q-learning, and suggests this simple modification to the standard Q-learning update—to give QRC—could provide a more general purpose algorithm

- Table1: Average area under the RMSPBE learning curve for each problem using the Adagrad stepsize selection algorithm. Bolded values highlight the lowest RMSPBE obtained for a given problem. All TD, HTD, and VTrace appear to converge very slowly with Adagrad. HTD still exhibits oscillating behavior and TD and VTrace show significant bias in final performance. These values correspond to the bar graphs in Figure 1
- Table2: Average area under the RMSPBE learning curve for each problem using the Adam stepsize selection algorithm. Bolded values highlight the lowest RMSPBE obtained for a given problem. All TD, HTD, and VTrace appear to be able to converge while using Adam, though convergence is very slow and not monotonic. These values correspond to the bar graphs in Figure 8
- Table3: Average area under the RMSPBE learning curve for each problem using the a constant stepsize. Bolded values highlight the lowest RMSPBE obtained for a given problem. These values correspond to the bar graphs in Figure 13

基金

- This work was funded by NSERC and CIFAR, particularly through funding the Alberta Machine Intelligence Institute (Amii) and the CCAI Chair program
- The authors also gratefully acknowledge funding from JPMorgan Chase & Co. and Google DeepMind

引用论文

- Glorot, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics (pp. 249-256).
- Hackman, L. (2012). Faster Gradient TD Algorithms. M.Sc. thesis, University of Alberta, Edmonton.
- Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann, San Francisco.
- Barto, Andrew G., Richard S. Sutton, Charles W. Anderson. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, 5, 834-846.
- Bellemare, M. G., Naddaf, Y., Veness, J., Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253-279.
- Juditsky, A., Nemirovski, A. (2011). Optimization for Machine Learning.
- Kingma, D. P., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Liu B, Liu J, Ghavamzadeh M, Mahadevan S, Petrik M (2015). Finite-Sample Analysis of Proximal Gradient TD Algorithms. In International Conference on Uncertainty in Artificial Intelligence, pp. 504-513.
- Liu B, Liu J, Ghavamzadeh M, Mahadevan S, Petrik M (2016). Proximal Gradient Temporal Difference Learning Algorithms. In International Joint Conference on Artificial Intelligence, pp. 4195-4199.
- Borkar, V. S., Meyn, S.P. (2000). The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning. SIAM J. Control and Optimization.
- Boyan, J.A. (2002). Technical Update: Least-Squares Temporal Difference Learning. Machine Learning.
- Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. (2017) Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning.
- Duchi, J., Hazan, E., Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:21212159.
- Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I. and Legg, S. (2018) IMPALA: Scalable distributed Deep-RL with importance weighted actor-learner architectures. In International Conference on Machine Learning.
- Feng, Y., Li, L., Liu, Q. (2019). A kernel loss for solving the bellman equation. In Advances in Neural Information Processing Systems (pp. 15430-15441).
- Mahadevan, S., Liu, B., Thomas, P., Dabney, W., Giguere, S., Jacek, N., Gemp, I., Liu, J. (2014). Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv:1405.6757.
- Mahmood, A. R., Yu, H., Sutton, R. S. (2017). Multi-step off-policy learning without importance sampling ratios. arXiv:1702.03006.
- Maei, H. R. (2011). Gradient temporal-difference learning algorithms. Ph.D. thesis, University of Alberta, Edmonton.
- Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 19281937).
- Munos, R., Stepleton, T., Harutyunyan, A., Bellemare, M. (2016). Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems 29, pp. 1046–1054.
- Reddi, S. J., Kale, S., Kumar, S. (2019). On the convergence of adam and beyond. arXiv:1904.09237.
- Schaul, T., Quan, J., Antonoglou, I., Silver, D. (2016). Prioritized experience replay. In International Conference on Learning Representations.
- Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., Wiewiora, E. (2009). Fast gradientdescent methods for temporal-difference learning with linear function approximation. In International Conference on Machine Learning, pp. 993–1000, ACM.
- Sutton, R. S., Mahmood A. R., and White M. (2016) An emphatic approach to the problem of off-policy temporaldifference learning. The Journal of Machine Learning Research.
- Sutton, R. S., Barto, A. G. (2018). Reinforcement Learning: An Introduction, Second Edition. MIT Press.
- Touati, A., Bacon, P. L., Precup, D., Vincent, P. (2018). Convergent tree-backup and retrace with function approximation. arXiv:1705.09322.
- van Hasselt, H. P., Guez, A., Hessel, M., Mnih, V., Silver, D. (2016). Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems (pp. 4287-4295).
- van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., Modayil, J. (2018). Deep Reinforcement Learning and the Deadly Triad. arXiv:1812.02648
- White, A., White, M. (2016). Investigating Practical Linear Temporal Difference Learning. In International Conference on Autonomous Agents & Multiagent Systems.
- Young, K., Tian, T. (2019). MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments. arXiv:1903.03176.
- The proofs of convergence for many of the methods require independent samples for the updates. This condition is not generally met in the fully online learning setting that we consider throughout the rest of the paper. In Figure 7 we show results for all methods in the fully offline batch setting, demonstrating that—on the small problems that we consider—the conclusions do not change when transferring from the batch setting to the online setting. We include two additional methods in the batch setting, the Kernel Residual Gradient methods (Feng et al., 2019), which do not have a clear fully online implementation.
- The Residual Gradient (RG) family of algorithms provide an alternative gradient-based strategy for performing temporal difference learning. The RG methods minimize the Mean Squared Bellman Error (MSBE), while the gradient TD family of algorithms minimize a particular form of the MSBE, the Mean Squared Projected Bellman Error (MSPBE). The RG family of methods generally suffer from difficulty in obtaining independent samples from the environment, leading towards stochastic optimization algorithms which find a biased solution (Sutton & Barto, 2018). However, very recent work has generalized the MSBE and proposed an algorithmic strategy to perform unbiased stochastic updates (Feng et al., 2019). Because our results suggest that RG methods generally underperform the gradient TD family of methods, we choose to focus our extension on gradient TD methods for this paper.
- TD fixed point under very similar conditions as TDC (Maei, 2011). We show the key steps here (for details see Maei
- (2011) or Appendix G). The G matrix for TDC++ is G =

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn