Limits...
Reinforcement learning or active inference?

Friston KJ, Daunizeau J, Kiebel SJ - PLoS ONE (2009)

Bottom Line: This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming.Critically, we do not need to invoke the notion of reward, value or utility.The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

View Article: PubMed Central - PubMed

Affiliation: The Wellcome Trust Centre for Neuroimaging, University College London, London, United Kingdom. k.friston@fil.ion.ucl.ac.uk

ABSTRACT
This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

Show MeSH
Equilibria in the state-space of the mountain car problem.Left panels: Flow-fields and associated equilibria for an uncontrolled environment (top), a controlled or optimised environment (middle) and under prior expectations after learning (bottom). Notice how the flow of states in the controlled environment enforces trajectories that start by moving away from the desired location (green dot at ). The arrows denote the flow of states (position and velocity) prescribed by the parameters. The equilibrium density in each row is the principal eigenfunction of the Fokker-Plank operator associated with the parameters. For the controlled and expected environments, these are low entropy equilibria, centred on the desired location. Right panels: These panels show the flow fields in terms of their clines. Nullclines correspond to lines in state-space where the rate of change or one variable is zero. Here the cline for position is along the x-axis, where velocity is zero. The cline for velocity is when the change in velocity goes from positive (grey) to negative (white). Fixed points correspond to the intersection of these clines. It can be seen that under an uncontrolled environment (top) there a stable fixed point, where the velocity cline intersects the position cline with negative slope. Under controlled (middle) and expected (bottom) dynamics there are three fixed points. The rightmost fixed-point is under the desired equilibrium density and is stable. The middle fixed-point is halfway up the hill and the final fixed-point is at the bottom. Both of these are unstable and repel trajectories so that they are ultimately attracted to the desired location. The red lines depict exemplar trajectories, under deterministic flow, from . In a controlled environment, this shows the optimum behaviour of moving up the opposite hill to gain momentum so that the desired location can be reached.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2713351&req=5

pone-0006421-g003: Equilibria in the state-space of the mountain car problem.Left panels: Flow-fields and associated equilibria for an uncontrolled environment (top), a controlled or optimised environment (middle) and under prior expectations after learning (bottom). Notice how the flow of states in the controlled environment enforces trajectories that start by moving away from the desired location (green dot at ). The arrows denote the flow of states (position and velocity) prescribed by the parameters. The equilibrium density in each row is the principal eigenfunction of the Fokker-Plank operator associated with the parameters. For the controlled and expected environments, these are low entropy equilibria, centred on the desired location. Right panels: These panels show the flow fields in terms of their clines. Nullclines correspond to lines in state-space where the rate of change or one variable is zero. Here the cline for position is along the x-axis, where velocity is zero. The cline for velocity is when the change in velocity goes from positive (grey) to negative (white). Fixed points correspond to the intersection of these clines. It can be seen that under an uncontrolled environment (top) there a stable fixed point, where the velocity cline intersects the position cline with negative slope. Under controlled (middle) and expected (bottom) dynamics there are three fixed points. The rightmost fixed-point is under the desired equilibrium density and is stable. The middle fixed-point is halfway up the hill and the final fixed-point is at the bottom. Both of these are unstable and repel trajectories so that they are ultimately attracted to the desired location. The red lines depict exemplar trajectories, under deterministic flow, from . In a controlled environment, this shows the optimum behaviour of moving up the opposite hill to gain momentum so that the desired location can be reached.

Mentions: The upper panels of Figure 3 show the equilibrium densities without control (; top row) and for the controlled environment that approximates our desired equilibrium (; middle row). Here, was a Gaussian density centred on x = 1 and with standard deviation of and respectively. We have now created an environment in which the desired location attracts all trajectories. As anticipated, the trajectories in Figure 3 (middle row) move away from the desired location initially and then converge on it. This controlled environment now plays host to a naïve agent, who must learn its dynamics through experience.


Reinforcement learning or active inference?

Friston KJ, Daunizeau J, Kiebel SJ - PLoS ONE (2009)

Equilibria in the state-space of the mountain car problem.Left panels: Flow-fields and associated equilibria for an uncontrolled environment (top), a controlled or optimised environment (middle) and under prior expectations after learning (bottom). Notice how the flow of states in the controlled environment enforces trajectories that start by moving away from the desired location (green dot at ). The arrows denote the flow of states (position and velocity) prescribed by the parameters. The equilibrium density in each row is the principal eigenfunction of the Fokker-Plank operator associated with the parameters. For the controlled and expected environments, these are low entropy equilibria, centred on the desired location. Right panels: These panels show the flow fields in terms of their clines. Nullclines correspond to lines in state-space where the rate of change or one variable is zero. Here the cline for position is along the x-axis, where velocity is zero. The cline for velocity is when the change in velocity goes from positive (grey) to negative (white). Fixed points correspond to the intersection of these clines. It can be seen that under an uncontrolled environment (top) there a stable fixed point, where the velocity cline intersects the position cline with negative slope. Under controlled (middle) and expected (bottom) dynamics there are three fixed points. The rightmost fixed-point is under the desired equilibrium density and is stable. The middle fixed-point is halfway up the hill and the final fixed-point is at the bottom. Both of these are unstable and repel trajectories so that they are ultimately attracted to the desired location. The red lines depict exemplar trajectories, under deterministic flow, from . In a controlled environment, this shows the optimum behaviour of moving up the opposite hill to gain momentum so that the desired location can be reached.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2713351&req=5

pone-0006421-g003: Equilibria in the state-space of the mountain car problem.Left panels: Flow-fields and associated equilibria for an uncontrolled environment (top), a controlled or optimised environment (middle) and under prior expectations after learning (bottom). Notice how the flow of states in the controlled environment enforces trajectories that start by moving away from the desired location (green dot at ). The arrows denote the flow of states (position and velocity) prescribed by the parameters. The equilibrium density in each row is the principal eigenfunction of the Fokker-Plank operator associated with the parameters. For the controlled and expected environments, these are low entropy equilibria, centred on the desired location. Right panels: These panels show the flow fields in terms of their clines. Nullclines correspond to lines in state-space where the rate of change or one variable is zero. Here the cline for position is along the x-axis, where velocity is zero. The cline for velocity is when the change in velocity goes from positive (grey) to negative (white). Fixed points correspond to the intersection of these clines. It can be seen that under an uncontrolled environment (top) there a stable fixed point, where the velocity cline intersects the position cline with negative slope. Under controlled (middle) and expected (bottom) dynamics there are three fixed points. The rightmost fixed-point is under the desired equilibrium density and is stable. The middle fixed-point is halfway up the hill and the final fixed-point is at the bottom. Both of these are unstable and repel trajectories so that they are ultimately attracted to the desired location. The red lines depict exemplar trajectories, under deterministic flow, from . In a controlled environment, this shows the optimum behaviour of moving up the opposite hill to gain momentum so that the desired location can be reached.
Mentions: The upper panels of Figure 3 show the equilibrium densities without control (; top row) and for the controlled environment that approximates our desired equilibrium (; middle row). Here, was a Gaussian density centred on x = 1 and with standard deviation of and respectively. We have now created an environment in which the desired location attracts all trajectories. As anticipated, the trajectories in Figure 3 (middle row) move away from the desired location initially and then converge on it. This controlled environment now plays host to a naïve agent, who must learn its dynamics through experience.

Bottom Line: This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming.Critically, we do not need to invoke the notion of reward, value or utility.The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

View Article: PubMed Central - PubMed

Affiliation: The Wellcome Trust Centre for Neuroimaging, University College London, London, United Kingdom. k.friston@fil.ion.ucl.ac.uk

ABSTRACT
This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

Show MeSH