Limits...
Dopamine, uncertainty and TD learning.

Niv Y, Duff MO, Dayan P - Behav Brain Funct (2005)

Bottom Line: Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively.However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric.We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel. yael@gatsby.ucl.ac.uk

ABSTRACT
Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.

No MeSH data available.


Related in: MedlinePlus

Averaged prediction errors in a probabilistic reward task (a) DA response in trials with different reward probabilities. Population peri-stimulus time histograms (PSTHs) show the summed spiking activity of several DA neurons over many trials, for each pr, pooled over rewarded and unrewarded trials at intermediate probabilities. (b) TD prediction error with asymmetric scaling. In the simulated task, in each trial one of five stimuli was randomly chosen and displayed at time t = 5. The stimulus was turned off at t = 25, at which time a reward was given with a probability of pr specified by the stimulus. We used a tapped delay-line representation of the stimuli (see text), with each stimulus represented by a different set of units ('neurons'). The TD error was δ(t) = r(t) + w(t - 1)·x(t) - w(t - 1)·x(t - 1), with r(t) the reward at time t, and x(t) and w(t) the state and weight vectors for the unit. A standard online TD learning rule was used with a fixed learning rate α, w(t) = w(t - 1) + αδ(t)x(t - 1), so each weight represented an expected future reward value. Similar to Fiorillo et al., we depict the prediction error δ(t) averaged over many trials, after the task has been learned. The representational asymmetry arises as negative values of δ(t) have been scaled by d = 1/6 prior to summation of the simulated PSTH, although learning proceeds according to unscaled errors. Finally, to account for the small positive responses at the time of the stimulus for pr = 0 and at the time of the (predicted) reward for pr = 1 seen in (a), we assumed a small (8%) chance that a predictive stimulus is misidentified. (c) DA response in pr = 0.5 trials, separated into rewarded (left) and unrewarded (right) trials. (d) TD Model of (c). (a,c) Reprinted with permission from [15]©2003 AAAS. Permission from AAAS is required for all other uses.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1171969&req=5

Figure 1: Averaged prediction errors in a probabilistic reward task (a) DA response in trials with different reward probabilities. Population peri-stimulus time histograms (PSTHs) show the summed spiking activity of several DA neurons over many trials, for each pr, pooled over rewarded and unrewarded trials at intermediate probabilities. (b) TD prediction error with asymmetric scaling. In the simulated task, in each trial one of five stimuli was randomly chosen and displayed at time t = 5. The stimulus was turned off at t = 25, at which time a reward was given with a probability of pr specified by the stimulus. We used a tapped delay-line representation of the stimuli (see text), with each stimulus represented by a different set of units ('neurons'). The TD error was δ(t) = r(t) + w(t - 1)·x(t) - w(t - 1)·x(t - 1), with r(t) the reward at time t, and x(t) and w(t) the state and weight vectors for the unit. A standard online TD learning rule was used with a fixed learning rate α, w(t) = w(t - 1) + αδ(t)x(t - 1), so each weight represented an expected future reward value. Similar to Fiorillo et al., we depict the prediction error δ(t) averaged over many trials, after the task has been learned. The representational asymmetry arises as negative values of δ(t) have been scaled by d = 1/6 prior to summation of the simulated PSTH, although learning proceeds according to unscaled errors. Finally, to account for the small positive responses at the time of the stimulus for pr = 0 and at the time of the (predicted) reward for pr = 1 seen in (a), we assumed a small (8%) chance that a predictive stimulus is misidentified. (c) DA response in pr = 0.5 trials, separated into rewarded (left) and unrewarded (right) trials. (d) TD Model of (c). (a,c) Reprinted with permission from [15]©2003 AAAS. Permission from AAAS is required for all other uses.

Mentions: Figure 1a shows population histograms of extracellularly-recorded DA cell activity, for each pr. TD theory predicts that the phasic activation of the DA cells at the time of the visual stimuli should correspond to the average expected reward, and so should increase with pr. Figure 1a shows exactly this – indeed, across the population, the increase is quite linear. Morris et al. [16] report a similar result in an instrumental (trace) conditioning task also involving probabilistic reinforcement.


Dopamine, uncertainty and TD learning.

Niv Y, Duff MO, Dayan P - Behav Brain Funct (2005)

Averaged prediction errors in a probabilistic reward task (a) DA response in trials with different reward probabilities. Population peri-stimulus time histograms (PSTHs) show the summed spiking activity of several DA neurons over many trials, for each pr, pooled over rewarded and unrewarded trials at intermediate probabilities. (b) TD prediction error with asymmetric scaling. In the simulated task, in each trial one of five stimuli was randomly chosen and displayed at time t = 5. The stimulus was turned off at t = 25, at which time a reward was given with a probability of pr specified by the stimulus. We used a tapped delay-line representation of the stimuli (see text), with each stimulus represented by a different set of units ('neurons'). The TD error was δ(t) = r(t) + w(t - 1)·x(t) - w(t - 1)·x(t - 1), with r(t) the reward at time t, and x(t) and w(t) the state and weight vectors for the unit. A standard online TD learning rule was used with a fixed learning rate α, w(t) = w(t - 1) + αδ(t)x(t - 1), so each weight represented an expected future reward value. Similar to Fiorillo et al., we depict the prediction error δ(t) averaged over many trials, after the task has been learned. The representational asymmetry arises as negative values of δ(t) have been scaled by d = 1/6 prior to summation of the simulated PSTH, although learning proceeds according to unscaled errors. Finally, to account for the small positive responses at the time of the stimulus for pr = 0 and at the time of the (predicted) reward for pr = 1 seen in (a), we assumed a small (8%) chance that a predictive stimulus is misidentified. (c) DA response in pr = 0.5 trials, separated into rewarded (left) and unrewarded (right) trials. (d) TD Model of (c). (a,c) Reprinted with permission from [15]©2003 AAAS. Permission from AAAS is required for all other uses.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1171969&req=5

Figure 1: Averaged prediction errors in a probabilistic reward task (a) DA response in trials with different reward probabilities. Population peri-stimulus time histograms (PSTHs) show the summed spiking activity of several DA neurons over many trials, for each pr, pooled over rewarded and unrewarded trials at intermediate probabilities. (b) TD prediction error with asymmetric scaling. In the simulated task, in each trial one of five stimuli was randomly chosen and displayed at time t = 5. The stimulus was turned off at t = 25, at which time a reward was given with a probability of pr specified by the stimulus. We used a tapped delay-line representation of the stimuli (see text), with each stimulus represented by a different set of units ('neurons'). The TD error was δ(t) = r(t) + w(t - 1)·x(t) - w(t - 1)·x(t - 1), with r(t) the reward at time t, and x(t) and w(t) the state and weight vectors for the unit. A standard online TD learning rule was used with a fixed learning rate α, w(t) = w(t - 1) + αδ(t)x(t - 1), so each weight represented an expected future reward value. Similar to Fiorillo et al., we depict the prediction error δ(t) averaged over many trials, after the task has been learned. The representational asymmetry arises as negative values of δ(t) have been scaled by d = 1/6 prior to summation of the simulated PSTH, although learning proceeds according to unscaled errors. Finally, to account for the small positive responses at the time of the stimulus for pr = 0 and at the time of the (predicted) reward for pr = 1 seen in (a), we assumed a small (8%) chance that a predictive stimulus is misidentified. (c) DA response in pr = 0.5 trials, separated into rewarded (left) and unrewarded (right) trials. (d) TD Model of (c). (a,c) Reprinted with permission from [15]©2003 AAAS. Permission from AAAS is required for all other uses.
Mentions: Figure 1a shows population histograms of extracellularly-recorded DA cell activity, for each pr. TD theory predicts that the phasic activation of the DA cells at the time of the visual stimuli should correspond to the average expected reward, and so should increase with pr. Figure 1a shows exactly this – indeed, across the population, the increase is quite linear. Morris et al. [16] report a similar result in an instrumental (trace) conditioning task also involving probabilistic reinforcement.

Bottom Line: Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively.However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric.We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel. yael@gatsby.ucl.ac.uk

ABSTRACT
Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.

No MeSH data available.


Related in: MedlinePlus