Limits...
Shifting responsibly: the importance of striatal modularity to reinforcement learning in uncertain environments.

Amemori K, Gibb LG, Graybiel AM - Front Hum Neurosci (2011)

Bottom Line: We then constructed a network model of basal ganglia circuitry that includes these modules and the direct and indirect pathways.Based on simple assumptions, this model suggests that while the direct pathway may promote actions based on striatal action values, the indirect pathway may act as a gating network that facilitates or suppresses behavioral modules on the basis of striatal responsibility signals.Our modeling functionally unites the modular compartmental organization of the striatum with the direct-indirect pathway divisions of the basal ganglia, a step that we suggest will have important clinical implications.

View Article: PubMed Central - PubMed

Affiliation: McGovern Institute for Brain Research, Massachusetts Institute of Technology Cambridge, MA, USA.

ABSTRACT
We propose here that the modular organization of the striatum reflects a context-sensitive modular learning architecture in which clustered striosome-matrisome domains participate in modular reinforcement learning (RL). Based on anatomical and physiological evidence, it has been suggested that the modular organization of the striatum could represent a learning architecture. There is not, however, a coherent view of how such a learning architecture could relate to the organization of striatal outputs into the direct and indirect pathways of the basal ganglia, nor a clear formulation of how such a modular architecture relates to the RL functions attributed to the striatum. Here, we hypothesize that striosome-matrisome modules not only learn to bias behavior toward specific actions, as in standard RL, but also learn to assess their own relevance to the environmental context and modulate their own learning and activity on this basis. We further hypothesize that the contextual relevance or "responsibility" of modules is determined by errors in predictions of environmental features and that such responsibility is assigned by striosomes and conveyed to matrisomes via local circuit interneurons. To examine these hypotheses and to identify the general requirements for realizing this architecture in the nervous system, we developed a simple modular RL model. We then constructed a network model of basal ganglia circuitry that includes these modules and the direct and indirect pathways. Based on simple assumptions, this model suggests that while the direct pathway may promote actions based on striatal action values, the indirect pathway may act as a gating network that facilitates or suppresses behavioral modules on the basis of striatal responsibility signals. Our modeling functionally unites the modular compartmental organization of the striatum with the direct-indirect pathway divisions of the basal ganglia, a step that we suggest will have important clinical implications.

No MeSH data available.


Module responsibility, module selection, and preferred location follow changes in environment. (A) Responsibility signals of module B (red) and module A (blue) as functions of time. (B) Difference of responsibility signals, λB − λA, (green line) plotted with the changing environment (Env. A or B; red). Positive differences imply greater module B responsibility, whereas negative differences imply greater module A responsibility. (C) Selected module (blue) and environment (red) as functions of time. In environment A (Env. A), reward is located at s = 1. In environment B (Env. B), reward is located at s = 14. Modules switch rapidly at first and then follow changes in environment. (D) Location of the agent as a function of time late in training, from time 27350 to time 27650 (green line). Blue line indicates location smoothed with a moving average with window of width 100. Red circles indicate the times and locations at which the agent obtained the reward. Environment changes from Env. B to Env. A at time 27500 (dashed line). The module switches from B to A around 27530. (E) Location of the agent as a function of time for the entire training period. Symbols are as in (D). After learning, the agent can obtain rewards in either terminal, depending on the environment. (F) Failure of learning of normal, non-modular RL. In this case, the model learns to obtain rewards only at s = 14.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3105240&req=5

Figure 4: Module responsibility, module selection, and preferred location follow changes in environment. (A) Responsibility signals of module B (red) and module A (blue) as functions of time. (B) Difference of responsibility signals, λB − λA, (green line) plotted with the changing environment (Env. A or B; red). Positive differences imply greater module B responsibility, whereas negative differences imply greater module A responsibility. (C) Selected module (blue) and environment (red) as functions of time. In environment A (Env. A), reward is located at s = 1. In environment B (Env. B), reward is located at s = 14. Modules switch rapidly at first and then follow changes in environment. (D) Location of the agent as a function of time late in training, from time 27350 to time 27650 (green line). Blue line indicates location smoothed with a moving average with window of width 100. Red circles indicate the times and locations at which the agent obtained the reward. Environment changes from Env. B to Env. A at time 27500 (dashed line). The module switches from B to A around 27530. (E) Location of the agent as a function of time for the entire training period. Symbols are as in (D). After learning, the agent can obtain rewards in either terminal, depending on the environment. (F) Failure of learning of normal, non-modular RL. In this case, the model learns to obtain rewards only at s = 14.

Mentions: Figure 3A shows the action-value function of each module. Module A learns to choose rightward movements, and module B learns to choose leftward movements. Figure 3B illustrates the prediction of each module, which is updated over time. The learning of predictions is faster than that of action values. Because the selected module is switched if the prediction model produces an error, different prediction models tend to update their values in different environments, and hence, the prediction models tend to produce different predictions. Figure 4A shows the responsibility of each module as a function of time; as a result of learning, the responsibility of each module changes depending on the environment. As we can see in Figure 4B, the temporal dynamics of the difference of responsibilities, λB − λA, follows the location of the reward (which changes depending on the environment). Correspondingly, the model learns to select one of the modules (Figure 4C) and move in the direction of the rewarded terminal (Figure 4D) consistently in each environment.


Shifting responsibly: the importance of striatal modularity to reinforcement learning in uncertain environments.

Amemori K, Gibb LG, Graybiel AM - Front Hum Neurosci (2011)

Module responsibility, module selection, and preferred location follow changes in environment. (A) Responsibility signals of module B (red) and module A (blue) as functions of time. (B) Difference of responsibility signals, λB − λA, (green line) plotted with the changing environment (Env. A or B; red). Positive differences imply greater module B responsibility, whereas negative differences imply greater module A responsibility. (C) Selected module (blue) and environment (red) as functions of time. In environment A (Env. A), reward is located at s = 1. In environment B (Env. B), reward is located at s = 14. Modules switch rapidly at first and then follow changes in environment. (D) Location of the agent as a function of time late in training, from time 27350 to time 27650 (green line). Blue line indicates location smoothed with a moving average with window of width 100. Red circles indicate the times and locations at which the agent obtained the reward. Environment changes from Env. B to Env. A at time 27500 (dashed line). The module switches from B to A around 27530. (E) Location of the agent as a function of time for the entire training period. Symbols are as in (D). After learning, the agent can obtain rewards in either terminal, depending on the environment. (F) Failure of learning of normal, non-modular RL. In this case, the model learns to obtain rewards only at s = 14.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3105240&req=5

Figure 4: Module responsibility, module selection, and preferred location follow changes in environment. (A) Responsibility signals of module B (red) and module A (blue) as functions of time. (B) Difference of responsibility signals, λB − λA, (green line) plotted with the changing environment (Env. A or B; red). Positive differences imply greater module B responsibility, whereas negative differences imply greater module A responsibility. (C) Selected module (blue) and environment (red) as functions of time. In environment A (Env. A), reward is located at s = 1. In environment B (Env. B), reward is located at s = 14. Modules switch rapidly at first and then follow changes in environment. (D) Location of the agent as a function of time late in training, from time 27350 to time 27650 (green line). Blue line indicates location smoothed with a moving average with window of width 100. Red circles indicate the times and locations at which the agent obtained the reward. Environment changes from Env. B to Env. A at time 27500 (dashed line). The module switches from B to A around 27530. (E) Location of the agent as a function of time for the entire training period. Symbols are as in (D). After learning, the agent can obtain rewards in either terminal, depending on the environment. (F) Failure of learning of normal, non-modular RL. In this case, the model learns to obtain rewards only at s = 14.
Mentions: Figure 3A shows the action-value function of each module. Module A learns to choose rightward movements, and module B learns to choose leftward movements. Figure 3B illustrates the prediction of each module, which is updated over time. The learning of predictions is faster than that of action values. Because the selected module is switched if the prediction model produces an error, different prediction models tend to update their values in different environments, and hence, the prediction models tend to produce different predictions. Figure 4A shows the responsibility of each module as a function of time; as a result of learning, the responsibility of each module changes depending on the environment. As we can see in Figure 4B, the temporal dynamics of the difference of responsibilities, λB − λA, follows the location of the reward (which changes depending on the environment). Correspondingly, the model learns to select one of the modules (Figure 4C) and move in the direction of the rewarded terminal (Figure 4D) consistently in each environment.

Bottom Line: We then constructed a network model of basal ganglia circuitry that includes these modules and the direct and indirect pathways.Based on simple assumptions, this model suggests that while the direct pathway may promote actions based on striatal action values, the indirect pathway may act as a gating network that facilitates or suppresses behavioral modules on the basis of striatal responsibility signals.Our modeling functionally unites the modular compartmental organization of the striatum with the direct-indirect pathway divisions of the basal ganglia, a step that we suggest will have important clinical implications.

View Article: PubMed Central - PubMed

Affiliation: McGovern Institute for Brain Research, Massachusetts Institute of Technology Cambridge, MA, USA.

ABSTRACT
We propose here that the modular organization of the striatum reflects a context-sensitive modular learning architecture in which clustered striosome-matrisome domains participate in modular reinforcement learning (RL). Based on anatomical and physiological evidence, it has been suggested that the modular organization of the striatum could represent a learning architecture. There is not, however, a coherent view of how such a learning architecture could relate to the organization of striatal outputs into the direct and indirect pathways of the basal ganglia, nor a clear formulation of how such a modular architecture relates to the RL functions attributed to the striatum. Here, we hypothesize that striosome-matrisome modules not only learn to bias behavior toward specific actions, as in standard RL, but also learn to assess their own relevance to the environmental context and modulate their own learning and activity on this basis. We further hypothesize that the contextual relevance or "responsibility" of modules is determined by errors in predictions of environmental features and that such responsibility is assigned by striosomes and conveyed to matrisomes via local circuit interneurons. To examine these hypotheses and to identify the general requirements for realizing this architecture in the nervous system, we developed a simple modular RL model. We then constructed a network model of basal ganglia circuitry that includes these modules and the direct and indirect pathways. Based on simple assumptions, this model suggests that while the direct pathway may promote actions based on striatal action values, the indirect pathway may act as a gating network that facilitates or suppresses behavioral modules on the basis of striatal responsibility signals. Our modeling functionally unites the modular compartmental organization of the striatum with the direct-indirect pathway divisions of the basal ganglia, a step that we suggest will have important clinical implications.

No MeSH data available.