A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback.
Bottom Line:
These analytical results imply that neurons can learn through reward-modulated STDP to classify not only spatial but also temporal firing patterns of presynaptic neurons.They also can learn to respond to specific presynaptic firing patterns with particular spike patterns.Our model for this experiment relies on a combination of reward-modulated STDP with variable spontaneous firing activity.
View Article:
PubMed Central - PubMed
Affiliation: Institute for Theoretical Computer Science, Graz University of Technology, Graz, Austria.
ABSTRACT
Show MeSH
Reward-modulated spike-timing-dependent plasticity (STDP) has recently emerged as a candidate for a learning rule that could explain how behaviorally relevant adaptive changes in complex networks of spiking neurons could be achieved in a self-organizing manner through local synaptic plasticity. However, the capabilities and limitations of this learning rule could so far only be tested through computer simulations. This article provides tools for an analytic treatment of reward-modulated STDP, which allows us to predict under which conditions reward-modulated STDP will achieve a desired learning effect. These analytical results imply that neurons can learn through reward-modulated STDP to classify not only spatial but also temporal firing patterns of presynaptic neurons. They also can learn to respond to specific presynaptic firing patterns with particular spike patterns. Finally, the resulting learning theory predicts that even difficult credit-assignment problems, where it is very hard to tell which synaptic weights should be modified in order to increase the global reward for the system, can be solved in a self-organizing manner through reward-modulated STDP. This yields an explanation for a fundamental experimental result on biofeedback in monkeys by Fetz and Baker. In this experiment monkeys were rewarded for increasing the firing rate of a particular neuron in the cortex and were able to solve this extremely difficult credit assignment problem. Our model for this experiment relies on a combination of reward-modulated STDP with variable spontaneous firing activity. Hence it also provides a possible functional explanation for trial-to-trial variability, which is characteristic for cortical networks of neurons but has no analogue in currently existing artificial computing systems. In addition our model demonstrates that reward-modulated STDP can be applied to all synapses in a large recurrent neural network without endangering the stability of the network dynamics. Related in: MedlinePlus |
Related In:
Results -
Collection
License getmorefigures.php?uid=PMC2543108&req=5
Mentions: We show that this phenomenon can in principle be explained by reward-modulatedSTDP. In order to do that, we define a model for the experiment which allows usto formulate an equation for the reward signald(t). This enables us to calculate synapticweight changes for this particular scenario. We consider as model a recurrentneural circuit where the spiking activity of one neuron k isrecorded by the experimenter (Experiments where two neurons are recorded andreinforced were also reported in [17]. We tested this case in computer simulations(see Figure 2) but did nottreat it explicitly in our theoretical analysis). We assume that in the monkeybrain a reward signal d(t) is produced whichdepends on the visual feedback (through an illuminated meter, whose pointerdeflection was dependent on the current firing rate of the randomly selectedneuron k) as well as previously received liquid rewards, andthat this signal d(t) is delivered toall synapses in large areas of the brain. We can formalizethis scenario by defining a reward signal which depends on the spike rate of thearbitrarily selected neuron k (see Figure 3A and 3B). More precisely, a rewardpulse of shape εr(r) (thereward kernel) is produced with some delay dr everytime the neuron k produces an action potential(9)Note thatd(t) = h(t)−h̅is defined in Equation 1 as a signal with zero mean. In order to satisfy thisconstraint, we assume that the reward kernelεr has zero mass, i.e., . For the analysis, we use the linear Poisson neuron modeldescribed in Methods. The mean weightchange for synapses to the reinforced neuron k is thenapproximately (see Methods)(10)This equation describes STDP with a learning rate proportional to . The outcome of the learning session will strongly depend onthis integral and thus on the form of the reward kernelεr. In order to reinforce high firingrates of the reinforced neuron we have chosen a reward kernel with a positivebump in the first few hundred milliseconds, and a long negative tail afterwards.Figure 3C shows thefunctions fc andεr that were used in our computer model, aswell as the product of these two functions. One sees that the integral over theproduct is positive and according to Equation 10 the synapses to the reinforcedneuron are subject to STDP. This does not guarantee an increase of the firingrate of the reinforced neuron. Instead, the changes of neuronal firing willdepend on the statistics of the inputs. In particular, the weights of synapsesto neuron k will not increase if that neuron does not firespontaneously. For uncorrelated Poisson input spike trains of equal rate, thefiring rate of a neuron trained by STDP stabilizes at some value which dependson the input rate (see [24],[25]). However, incomparison to the low spontaneous firing rates observed in the biofeedbackexperiment [17], the stable firing rate under STDP can be muchhigher, allowing for a significant rate increase. It was shown in [17] thatalso low firing rates of a single neuron can be reinforced. In order to modelthis, we have chosen a reward kernel with a negative bump in the first fewhundred milliseconds, and a long positive tail afterwards, i.e. we inverted thekernel used above to obtain a negative integral . According to Equation 10 this leads to anti-STDP where notonly inputs to the reinforced neuron which have low correlations with the outputare depressed (because of the negative integral of the learning window), butalso those which are causally correlated with the output. This leads to a quickfiring rate decrease at the reinforced neuron. |
View Article: PubMed Central - PubMed
Affiliation: Institute for Theoretical Computer Science, Graz University of Technology, Graz, Austria.