Limits...
Dopamine, uncertainty and TD learning.

Niv Y, Duff MO, Dayan P - Behav Brain Funct (2005)

Bottom Line: Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively.However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric.We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel. yael@gatsby.ucl.ac.uk

ABSTRACT
Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.

No MeSH data available.


Trace conditioning with probabilistic rewards. (a) An illustration of one trial of the delay conditioning task of Fiorillo et al. [15]. A trial consists of a 2-second visual stimulus, the offset of which coincides with the delivery of the juice reward, if such a reward is programmed according to the probability associated with the visual cue. In unrewarded trials the stimulus terminated without a reward. In both cases an inter-trial interval of 9 seconds on average separates trials. (b) An illustration of one trial of the trace conditioning task of Morris et al. [16]. The crucial difference is that there is now a substantial temporal delay between the offset of the stimulus and the onset of the reward (the "trace" period), and no external stimulus indicates the expected time of reward. This confers additional uncertainty as precise timing of the predicted reward must be internally resolved, especially in unrewarded trials. In this task, as in [15], one of several visual stimuli (not shown) was presented in each trial, and each stimulus was associated with a probability of reward. Here, also, the monkey was requested to perform an instrumental response (pressing the key corresponding to the side in which the stimulus was presented), the failure of which terminated the trial without a reward. Trials were separated by variable inter-trial intervals. (c,d) DA firing rate (smoothed) relative to baseline, around the expected time of the reward, in rewarded trials (c) and in unrewarded trials (d). (c,d) Reprinted from [16] ©2004 with permission from Elsevier. The traces imply an overall positive response at the expected time of the reward, but with a very small, or no ramp preceding this. Similar results were obtained in a classical conditioning task briefly described in [15], which employed a trace conditioning procedure, confirming that the trace period, and not the instrumental nature of the task depicted in (b) was the crucial difference from (a).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1171969&req=5

Figure 3: Trace conditioning with probabilistic rewards. (a) An illustration of one trial of the delay conditioning task of Fiorillo et al. [15]. A trial consists of a 2-second visual stimulus, the offset of which coincides with the delivery of the juice reward, if such a reward is programmed according to the probability associated with the visual cue. In unrewarded trials the stimulus terminated without a reward. In both cases an inter-trial interval of 9 seconds on average separates trials. (b) An illustration of one trial of the trace conditioning task of Morris et al. [16]. The crucial difference is that there is now a substantial temporal delay between the offset of the stimulus and the onset of the reward (the "trace" period), and no external stimulus indicates the expected time of reward. This confers additional uncertainty as precise timing of the predicted reward must be internally resolved, especially in unrewarded trials. In this task, as in [15], one of several visual stimuli (not shown) was presented in each trial, and each stimulus was associated with a probability of reward. Here, also, the monkey was requested to perform an instrumental response (pressing the key corresponding to the side in which the stimulus was presented), the failure of which terminated the trial without a reward. Trials were separated by variable inter-trial intervals. (c,d) DA firing rate (smoothed) relative to baseline, around the expected time of the reward, in rewarded trials (c) and in unrewarded trials (d). (c,d) Reprinted from [16] ©2004 with permission from Elsevier. The traces imply an overall positive response at the expected time of the reward, but with a very small, or no ramp preceding this. Similar results were obtained in a classical conditioning task briefly described in [15], which employed a trace conditioning procedure, confirming that the trace period, and not the instrumental nature of the task depicted in (b) was the crucial difference from (a).

Mentions: By contrast, at the time of potential reward delivery, TD theory predicts that on average there should be no activity, as, on average, there is no prediction error at that time. Of course, in the probabilistic reinforcement design (at least for pr ≠ 0, 1) there is in fact a prediction error at the time of delivery or non-delivery of reward on every single trial. On trials in which a reward is delivered, the prediction error should be positive (as the reward obtained is larger than the average reward expected). Conversely, on trials with no reward it should be negative (see Figure 1c). Crucially, under TD, the average of these differences, weighted by their probabilities of occurring, should be zero. If it is not zero, then this prediction error should act as a plasticity signal, changing the predictions until there is no prediction error. At variance with this expectation, the data in Figure 1a which is averaged over both rewarded and unrewarded trials, show that there is in fact positive mean activity at this time. This is also evident in the data of Morris et al. [16] (see Figure 3c). The positive DA responses show no signs of disappearing even with substantial training (over the course of months).


Dopamine, uncertainty and TD learning.

Niv Y, Duff MO, Dayan P - Behav Brain Funct (2005)

Trace conditioning with probabilistic rewards. (a) An illustration of one trial of the delay conditioning task of Fiorillo et al. [15]. A trial consists of a 2-second visual stimulus, the offset of which coincides with the delivery of the juice reward, if such a reward is programmed according to the probability associated with the visual cue. In unrewarded trials the stimulus terminated without a reward. In both cases an inter-trial interval of 9 seconds on average separates trials. (b) An illustration of one trial of the trace conditioning task of Morris et al. [16]. The crucial difference is that there is now a substantial temporal delay between the offset of the stimulus and the onset of the reward (the "trace" period), and no external stimulus indicates the expected time of reward. This confers additional uncertainty as precise timing of the predicted reward must be internally resolved, especially in unrewarded trials. In this task, as in [15], one of several visual stimuli (not shown) was presented in each trial, and each stimulus was associated with a probability of reward. Here, also, the monkey was requested to perform an instrumental response (pressing the key corresponding to the side in which the stimulus was presented), the failure of which terminated the trial without a reward. Trials were separated by variable inter-trial intervals. (c,d) DA firing rate (smoothed) relative to baseline, around the expected time of the reward, in rewarded trials (c) and in unrewarded trials (d). (c,d) Reprinted from [16] ©2004 with permission from Elsevier. The traces imply an overall positive response at the expected time of the reward, but with a very small, or no ramp preceding this. Similar results were obtained in a classical conditioning task briefly described in [15], which employed a trace conditioning procedure, confirming that the trace period, and not the instrumental nature of the task depicted in (b) was the crucial difference from (a).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1171969&req=5

Figure 3: Trace conditioning with probabilistic rewards. (a) An illustration of one trial of the delay conditioning task of Fiorillo et al. [15]. A trial consists of a 2-second visual stimulus, the offset of which coincides with the delivery of the juice reward, if such a reward is programmed according to the probability associated with the visual cue. In unrewarded trials the stimulus terminated without a reward. In both cases an inter-trial interval of 9 seconds on average separates trials. (b) An illustration of one trial of the trace conditioning task of Morris et al. [16]. The crucial difference is that there is now a substantial temporal delay between the offset of the stimulus and the onset of the reward (the "trace" period), and no external stimulus indicates the expected time of reward. This confers additional uncertainty as precise timing of the predicted reward must be internally resolved, especially in unrewarded trials. In this task, as in [15], one of several visual stimuli (not shown) was presented in each trial, and each stimulus was associated with a probability of reward. Here, also, the monkey was requested to perform an instrumental response (pressing the key corresponding to the side in which the stimulus was presented), the failure of which terminated the trial without a reward. Trials were separated by variable inter-trial intervals. (c,d) DA firing rate (smoothed) relative to baseline, around the expected time of the reward, in rewarded trials (c) and in unrewarded trials (d). (c,d) Reprinted from [16] ©2004 with permission from Elsevier. The traces imply an overall positive response at the expected time of the reward, but with a very small, or no ramp preceding this. Similar results were obtained in a classical conditioning task briefly described in [15], which employed a trace conditioning procedure, confirming that the trace period, and not the instrumental nature of the task depicted in (b) was the crucial difference from (a).
Mentions: By contrast, at the time of potential reward delivery, TD theory predicts that on average there should be no activity, as, on average, there is no prediction error at that time. Of course, in the probabilistic reinforcement design (at least for pr ≠ 0, 1) there is in fact a prediction error at the time of delivery or non-delivery of reward on every single trial. On trials in which a reward is delivered, the prediction error should be positive (as the reward obtained is larger than the average reward expected). Conversely, on trials with no reward it should be negative (see Figure 1c). Crucially, under TD, the average of these differences, weighted by their probabilities of occurring, should be zero. If it is not zero, then this prediction error should act as a plasticity signal, changing the predictions until there is no prediction error. At variance with this expectation, the data in Figure 1a which is averaged over both rewarded and unrewarded trials, show that there is in fact positive mean activity at this time. This is also evident in the data of Morris et al. [16] (see Figure 3c). The positive DA responses show no signs of disappearing even with substantial training (over the course of months).

Bottom Line: Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively.However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric.We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors.

View Article: PubMed Central - HTML - PubMed

Affiliation: Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel. yael@gatsby.ucl.ac.uk

ABSTRACT
Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.

No MeSH data available.