Limits...
The Computational Development of Reinforcement Learning during Adolescence.

Palminteri S, Kilford EJ, Coricelli G, Blakemore SJ - PLoS Comput. Biol. (2016)

Bottom Line: Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules.Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback.This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence.

View Article: PubMed Central - PubMed

Affiliation: Institute of Cognitive Neuroscience, University College London, London, United Kingdom.

ABSTRACT
Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents' behaviour was better explained by a basic reinforcement learning algorithm, adults' behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence.

No MeSH data available.


Computational models and ex-ante model simulations.(A) The schematic illustrates the computational architecture of the model space. For each context (or state, ‘s’), the agent tracks option values (Q(s,:)), which are used to decide amongst alternative courses of action. In all contexts, the agent is informed about the outcome corresponding to the chosen option (R(c)), which is used to update the chosen option value (Q(s,c)) via a prediction error (δ(c)). This computational module (“factual module”) requires a learning rate (α1). In the presence of complete feedback, the agent is also informed about the outcome of the unchosen option (R(u)), which is used to update the unchosen option value (Q(s,u)) via a prediction error (δ(u)). This computational module (“counterfactual module”) requires a specific learning rate (α2). In addition to tracking option value, the agent also tracks the value of the context (V(s)), which is also updated via a prediction error (δ(v)), integrating over all available feedback information (R(c) and, where applicable, R(u)). This computational module (“contextual module”) requires a specific learning rate (α3). The full model (Model 3) can be reduced to Model 2 by suppressing the contextual module (i.e. assuming α3 = 0). Model 2 can be reduced to Model 1 (basic Q-learning) by suppressing the counterfactual module (i.e. assuming α2 = α3 = 0). (B). Bars represent the model estimates of option values (top row) and decision values (bottom row), plotted as a function of the computational models and task contexts. G75 and G25: options associated with 75% and 25% chance of gaining a point, respectively; L75 and L25: options associated with 75% and 25% chance of losing a point, respectively. “Decision value” represents the difference in value between the correct and incorrect options (G75 minus G25 in Reward contexts; L25 minus L75 in Punishment contexts). Fig 2A was adapted by the authors from a figure originally published in [15], licensed under CC BY 4.0.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4920542&req=5

pcbi.1004953.g002: Computational models and ex-ante model simulations.(A) The schematic illustrates the computational architecture of the model space. For each context (or state, ‘s’), the agent tracks option values (Q(s,:)), which are used to decide amongst alternative courses of action. In all contexts, the agent is informed about the outcome corresponding to the chosen option (R(c)), which is used to update the chosen option value (Q(s,c)) via a prediction error (δ(c)). This computational module (“factual module”) requires a learning rate (α1). In the presence of complete feedback, the agent is also informed about the outcome of the unchosen option (R(u)), which is used to update the unchosen option value (Q(s,u)) via a prediction error (δ(u)). This computational module (“counterfactual module”) requires a specific learning rate (α2). In addition to tracking option value, the agent also tracks the value of the context (V(s)), which is also updated via a prediction error (δ(v)), integrating over all available feedback information (R(c) and, where applicable, R(u)). This computational module (“contextual module”) requires a specific learning rate (α3). The full model (Model 3) can be reduced to Model 2 by suppressing the contextual module (i.e. assuming α3 = 0). Model 2 can be reduced to Model 1 (basic Q-learning) by suppressing the counterfactual module (i.e. assuming α2 = α3 = 0). (B). Bars represent the model estimates of option values (top row) and decision values (bottom row), plotted as a function of the computational models and task contexts. G75 and G25: options associated with 75% and 25% chance of gaining a point, respectively; L75 and L25: options associated with 75% and 25% chance of losing a point, respectively. “Decision value” represents the difference in value between the correct and incorrect options (G75 minus G25 in Reward contexts; L25 minus L75 in Punishment contexts). Fig 2A was adapted by the authors from a figure originally published in [15], licensed under CC BY 4.0.

Mentions: More precisely, we hypothesise that, while the basic RL algorithm successfully encapsulates value-based decision-making in adolescence, adults integrate more sophisticated computations, such as counterfactual learning and value contextualisation. To test this hypothesis, adults and adolescents performed an instrumental probabilistic learning task in which they had to learn which stimuli had the greatest likelihood of resulting in an advantageous outcome through trial and error. Both outcome valence (Reward vs. Punishment) and feedback type (Partial vs. Complete) were manipulated using a within-subjects factorial design (Fig 1A). This allowed us to investigate both punishment avoidance learning and counterfactual learning within the same paradigm. In a previous study, model comparison showed that adult behaviour in this task is best explained by a computational model in which basic RL is augmented by a counterfactual learning module (to account for learning from outcomes of unchosen options) and a value contextualisation module (to account for learning efficiently to avoid punishments) (Fig 2A)[15]. Our computational and behavioural results are consistent with our hypothesis and show that adolescents utilise a different, simpler computational strategy to perform the task.


The Computational Development of Reinforcement Learning during Adolescence.

Palminteri S, Kilford EJ, Coricelli G, Blakemore SJ - PLoS Comput. Biol. (2016)

Computational models and ex-ante model simulations.(A) The schematic illustrates the computational architecture of the model space. For each context (or state, ‘s’), the agent tracks option values (Q(s,:)), which are used to decide amongst alternative courses of action. In all contexts, the agent is informed about the outcome corresponding to the chosen option (R(c)), which is used to update the chosen option value (Q(s,c)) via a prediction error (δ(c)). This computational module (“factual module”) requires a learning rate (α1). In the presence of complete feedback, the agent is also informed about the outcome of the unchosen option (R(u)), which is used to update the unchosen option value (Q(s,u)) via a prediction error (δ(u)). This computational module (“counterfactual module”) requires a specific learning rate (α2). In addition to tracking option value, the agent also tracks the value of the context (V(s)), which is also updated via a prediction error (δ(v)), integrating over all available feedback information (R(c) and, where applicable, R(u)). This computational module (“contextual module”) requires a specific learning rate (α3). The full model (Model 3) can be reduced to Model 2 by suppressing the contextual module (i.e. assuming α3 = 0). Model 2 can be reduced to Model 1 (basic Q-learning) by suppressing the counterfactual module (i.e. assuming α2 = α3 = 0). (B). Bars represent the model estimates of option values (top row) and decision values (bottom row), plotted as a function of the computational models and task contexts. G75 and G25: options associated with 75% and 25% chance of gaining a point, respectively; L75 and L25: options associated with 75% and 25% chance of losing a point, respectively. “Decision value” represents the difference in value between the correct and incorrect options (G75 minus G25 in Reward contexts; L25 minus L75 in Punishment contexts). Fig 2A was adapted by the authors from a figure originally published in [15], licensed under CC BY 4.0.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4920542&req=5

pcbi.1004953.g002: Computational models and ex-ante model simulations.(A) The schematic illustrates the computational architecture of the model space. For each context (or state, ‘s’), the agent tracks option values (Q(s,:)), which are used to decide amongst alternative courses of action. In all contexts, the agent is informed about the outcome corresponding to the chosen option (R(c)), which is used to update the chosen option value (Q(s,c)) via a prediction error (δ(c)). This computational module (“factual module”) requires a learning rate (α1). In the presence of complete feedback, the agent is also informed about the outcome of the unchosen option (R(u)), which is used to update the unchosen option value (Q(s,u)) via a prediction error (δ(u)). This computational module (“counterfactual module”) requires a specific learning rate (α2). In addition to tracking option value, the agent also tracks the value of the context (V(s)), which is also updated via a prediction error (δ(v)), integrating over all available feedback information (R(c) and, where applicable, R(u)). This computational module (“contextual module”) requires a specific learning rate (α3). The full model (Model 3) can be reduced to Model 2 by suppressing the contextual module (i.e. assuming α3 = 0). Model 2 can be reduced to Model 1 (basic Q-learning) by suppressing the counterfactual module (i.e. assuming α2 = α3 = 0). (B). Bars represent the model estimates of option values (top row) and decision values (bottom row), plotted as a function of the computational models and task contexts. G75 and G25: options associated with 75% and 25% chance of gaining a point, respectively; L75 and L25: options associated with 75% and 25% chance of losing a point, respectively. “Decision value” represents the difference in value between the correct and incorrect options (G75 minus G25 in Reward contexts; L25 minus L75 in Punishment contexts). Fig 2A was adapted by the authors from a figure originally published in [15], licensed under CC BY 4.0.
Mentions: More precisely, we hypothesise that, while the basic RL algorithm successfully encapsulates value-based decision-making in adolescence, adults integrate more sophisticated computations, such as counterfactual learning and value contextualisation. To test this hypothesis, adults and adolescents performed an instrumental probabilistic learning task in which they had to learn which stimuli had the greatest likelihood of resulting in an advantageous outcome through trial and error. Both outcome valence (Reward vs. Punishment) and feedback type (Partial vs. Complete) were manipulated using a within-subjects factorial design (Fig 1A). This allowed us to investigate both punishment avoidance learning and counterfactual learning within the same paradigm. In a previous study, model comparison showed that adult behaviour in this task is best explained by a computational model in which basic RL is augmented by a counterfactual learning module (to account for learning from outcomes of unchosen options) and a value contextualisation module (to account for learning efficiently to avoid punishments) (Fig 2A)[15]. Our computational and behavioural results are consistent with our hypothesis and show that adolescents utilise a different, simpler computational strategy to perform the task.

Bottom Line: Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules.Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback.This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence.

View Article: PubMed Central - PubMed

Affiliation: Institute of Cognitive Neuroscience, University College London, London, United Kingdom.

ABSTRACT
Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment) and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed). Computational strategies changed during development: whereas adolescents' behaviour was better explained by a basic reinforcement learning algorithm, adults' behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback) and a value contextualisation module (enabling symmetrical reward and punishment learning). Unlike adults, adolescent performance did not benefit from counterfactual (complete) feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence.

No MeSH data available.