• No results found

THE REINFORCED BRAIN

N/A
N/A
Protected

Academic year: 2021

Share "THE REINFORCED BRAIN"

Copied!
25
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

THE  REINFORCED  BRAIN  

Daniel  Lindh,  Amsterdam  Brain  and  Cognitive  Sciences,  University  of  Amsterdam   INTRODUCTION  

One  of  the  most  essential  attributes  enabling  successful  survival  of  an  organism  is  the  affinity  to   pursue  rewarding  states  while  simultaneously  avoiding  harmful  situations. The  term  “reward”   describes  the  positive  value  that  an  individual  ascribes  to  an  object,  behavioral  act,  or  internal   physical  state.  The  evolutionary  purpose  of  reward  representation  in  the  brain  seems  clear:  to   reinforce   advantageous   behavior.   Being   able   to   classify   and   internally   represent,   over   a   long   period  of  time,  which  actions  or  situations  are  beneficial  is  crucial  for  evolutionary  success,  i.e.   learning.  So  learning,  in  its  most  primitive  form,  can  be  seen  as  an  ability  to  utilize  knowledge  of   past  experiences,  together  with  current  events,  to  predict  future  states.  There  is  a  vast  corpus  of   reward  literature  investigating  the  role  of  different  neuromodulators  in  reward  processing.  Since   the  80s  and  forward,  the  most  investigated  neurotransmitter  in  the  literature  is  dopamine.  The   dopaminergic   reward   system   can   be   a   fragile   structure,   where   erroneously   learned   contingencies  can  lead  to  severe  substance  abuse,  as  in  cocaine  (Carlezon  Jr.  et  al.,  1998;  Sora  et   al.,   1998)   or   morphine   (Hnasko,   Sotak,   &   Palmiter,   2005)   addiction.   Other   medical   disorders   associated   with   reward   processing   dysfunction,   such   as   schizophrenia   (Nestor   et   al.,   2014;   Simon   et   al.,   2015),   major   depression   disorder   (Ubl   et   al.,   2015)   and   gambling   addiction   (Dymond  et  al.,  2014)  also  highlight  the  importance  of  a  properly  functioning  reward  system.  In   order   to   form   a   comprehensive   understanding   of   how   reward   is   implemented   in   the   brain,   several   levels   of   explanation   must   be   utilized.   Firstly,   we   need   to   understand   which   brain   structures   are   involved,   how   they   interact   and   what   kind   of   processes   they   are   involved   in.   Secondly,   we   need   to   consider   the   role   attention   plays   in   both   high   and   low   order   levels   of   reward   processing.   Finally,   we   must   consider   models   of   reinforcement   in   order   to   assess   mechanistic   predictions.   Because   disentangling   reward   from   other   concomitant   processes   is   extremely   difficult   both   in   animal   and   human   research,   in   the   current   review   I   focus   on   reinforcement   learning   in   sensory   processing.   Sensory   processing,   apart   from   being   more   straightforward   than   higher   cognition,   acts   as   a   good   model   of   how   the   brain   implements   reward   learning   in   all   types   of   cognition.   Here,   I   will   discuss   perceptual   learning,   reward   modulated  sensory  processing,  models,  and  the  role  of  attention.    

 

THE  REW ARDED  BRAIN  

A  stimulus  in  itself  does  not  intrinsically  contain  reward  value;  organisms  assign  different  values   to  stimuli  based  on  their  current  internal  states  and  as  a  function  of  their  previous  experiences.   Accordingly,   reward   is   implemented   in   the   brain   via   various   neuronal   reward   signals,   only   existing  as  information  encoded  within  and  between  neurons.  In  single  neuron  recordings,  one  

(2)

of  the  most  influential  reward-­‐related  findings  have  come  from  dopamine  neurons  in  substantia   nigra   and   ventral   tegmental   areas   (Fiorillo,   Tobler,   &   Schultz,   2003;   W   Schultz,   Dayan,   &   Montague,  1997;  W  Schultz,  1986).  These  findings  have  later  been  interpreted  as  reflecting  the   prediction  error  in  reinforced  learning,  computed  as  the  discrepancy  between  expectation  and   outcome.  Signs  of  this  type  of  computation  have  also  been  found  in  several  diverse  structures,   such   as   striatum   (McClure,   Berns,   &   Montague,   2003),   anterior   cingulate   cortex   (Hayden,   Heilbronner,  Pearson,  &  Platt,  2011)  and  frontal  cortex  (Ramnani,  Elliott,  Athwal,  &  Passingham,   2004).  Multiple  lines  of  evidence  support  the  idea  that  these  neurons  construct  and  distribute   information   about   rewarding   events   (Glimcher,   2010;   Koob,   1992;   Lak   et   al.,   2014).   More   specifically,  they  convey  the  signed  valence,  meaning  they  code  for  the  motivational,  or  reward-­‐ related,   value   of   upcoming   events.   Furthermore,   the   highly   interconnected   structure   of   electrical  synapses  between  these  dopamine  neurons  (Vandecasteele,  2005)  prohibits  individual   neurons   from   firing   alone   (Komendantov   &   Canavier,   2002).   This   is   pertinent,   because   it   promotes  a  necessarily  high  neural  collaboration  in  order  to  reach  a  threshold  high  enough  for   an   actual   output   from   the   midbrain.   This   also   means   that   recordings   of   only   a   few   dopamine   neurons  give  a  good  estimation  of  what  the  rest  of  the  neural  population  is  up  to.  Schultz  and   colleagues   were   the   first   ones   to   investigate   the   role   of   dopamine   neurons   in   motivational   processing.   In   one   of   their   pioneer   studies,   Schultz,   Apicella,   and   Ljungberg   (1993)   trained   monkeys   to   execute   three   different   tasks   while   performing   extracellular   recordings   of   single   dopamine  neurons  located  in  the  left  substantia  nigra.  In  the  first  experiment,  a  spatial  choice   task,   the   monkeys   were   trained   to   press   a   lever   indicated   by   a   light.   A   liquid   reward   was   administrated  500  ms  after  correct  lever  touch,  making  it  possible  to  differentiate  between  the   motor   movement   and   the   reward   encoding.   In   the   second   experiment   a   cue   to   initiate   movement   was   presented   1   second   after   the   onset   of   the   instruction   cue,   which   now   served   only  as  a  preparatory  signal.  The  final  experiment,  a  delayed  response  task,  was  setup  similarly   to   the   instructed   spatial   task,   however   the   initiation   cue   was   presented   between   2.5-­‐3.5   seconds   after   the   instruction   cue,   forcing   the   monkey   to   keep   a   representation   of   the   target   lever  in  working  memory  in  the  interim.  The  combined  results  of  these  experiments  showed  that   dopamine   neurons   respond   to   stimuli   crucial   for   performing   a   behavioral   task   and   learning.     Specifically,  early  on  dopamine  neurons  responded  to  the  reward,  but  once  the  association  was   learned,  the  response  was  shifted  to  the  onset  of  the  instructional  cue.  There  was  no  ongoing   activity  between  the  instructional  cue  and  the  initiation  cue,  implying  that  these  neurons  do  not   encode   working   memory   for   the   prospective   action.   A   series   of   follow-­‐up   studies   were   then   conducted  to  map  out  the  very  nature  of  these  signals.  Especially,  Fiorillo,  Tobler,  and  Schultz   (2003)  reported  that  this  shift  from  reward  towards  cue  related  response  scaled  with  certainty   of   association   between   the   two.   That   is,   cues   with   larger   uncertainty   of   reward   exhibits   a   stronger  signal  at  the  time  of  reward,  but  weaker  signal  at  the  time  of  the  cue,  reflecting  the   uncertainty  of  the  whole  event.  So  it  seemed  abundantly  clear  that  dopamine  neurons  within   these   subcortical   structures   actually   signal   some   type   of   reward   prediction   error   (i.e.   the   discrepancy  between  expected  and  actual  reward).    

In  fact,  expression  of  reward  prediction  errors  largely  overlaps  with  regions  coding  for  expected   reward   and   motor-­‐areas   corresponding   to   the   recently   chosen   action   (Doherty   et   al.,   2004;   Padoa-­‐Schioppa  &  Assad,  2006;  Palminteri,  Boraud,  Lafargue,  Dubois,  &  Pessiglione,  2009).  This   corroborates   the   notion   that   the   function   of   prediction   errors   is   to   work   as   a   teaching   signal,   improving   future   reward   predictions   and   movement   selections   in   the   relevant   brain   circuits.  

(3)

Furthermore,   dopamine   neurons   in   the   midbrain   also   seem   to   be   coding   motivational   properties.  Indeed,  Satoh,  Nakai,  Sato,  and  Kimura  (2003),  using  invasive  recordings  in  monkey’s   midbrain,  showed  that  dopamine  neurons  in  substantia  nigra  and  ventral  tegmental  area  seem   to  have  at  least  three  distinctive  types  of  functions.  Firstly,  dopamine  neurons  responding  to  a   conditioned  stimulus  (CS)  are  encoding  motivational  engagement  at  the  start  of  the  trial.    More   specifically,  Satoh  and  colleagues  showed  that  expected  reward  correlated  highly  with  reaction   time   (RT).   This   implies   that   the   dopamine   neurons   coded   the   actual   motivation   of   a   future   action,  assuming  higher  motivation  would  lead  to  shorter  RTs.  Secondly,  perfectly  in  line  with   earlier   findings   (Fiorillo   et   al.,   2003;   Schultz   et   al.,   1997),   they   also   showed   that   dopamine   neurons  accurately  encode  the  reward  prediction  error  for  a  positive  reinforcer.  Thirdly,  while   the  precise  coding  of  reward  prediction  errors  was  something  that  was  learned  by  the  monkey   throughout  the  whole  experiment,  sign  of  motivation  in  dopamine  firing  rate  was  constant  all   throughout.   A   related   influential   idea   is   that   slow   tonic   dopamine   reflects   overall   motivation/satiety   (Salamone   &   Correa,   2012)   and   fast   phasic   dopamine   signaling   supports   learning  (Satoh  et  al.,  2003).  For  example,  Hamid  et  al.  (2015)  measured  dopamine  release  from   the  nucleus  accumbens  over  several  different  timescales.  They  showed  that  motivational  vigor   and  reward-­‐rate  co-­‐varied  with  minute-­‐to-­‐minute  dopamine,  while  at  the  same  time  second-­‐by-­‐ second   dopamine   release   coded   for   an   estimate   of   the   temporally   discarded   future   reward.   These  findings  suggest  that  dopamine  conveys  one  single  decision  variable  that  signals  the  value   of   work.   Although   most   research   has   focused   on   the   dopaminergic   system   when   it   comes   to   reward   related   predictions,   it   is   possible   that   dopamine   shares   this   mechanism   with   other   neuromodulators.  For  example,  recent  findings  have  shown  tonic  serotonin  in  raphe  nuclei  to   also  reflect  motivation,  whereas  phasic  serotonin  reflects  a  reward  anticipation  and  prediction   error  (Li  et  al.,  2016).  

Despite   the   complex   nature   of   the   dopaminergic   system   in   the   midbrain,   it   is   not   in   itself   sufficient  to  account  for  the  full  range  of  processes  involved  in  learning.  For  example,  dopamine   neurons  in  the  midbrain  have  a  baseline  firing  rate  around  4-­‐5  Hz,  and  a  firing  rate  of  up  to  30   Hz   elicited   by   a   positive   reward   experience   (Montague,   Dayan,   &   Sejnowski,   1996).   Knowing   this,   and   assuming   a   linear   coding   for   negative   and   positive   reinforcers,   it   is   improbable   that   these  neurons  code  for  the  whole  reward  spectrum.  This  opens  up  the  exciting  idea  of  different   regions   being   responsible   for   the   coding   of   negative   reinforcers,   and   one   such   candidate   is   Habenula   (Hb)   (Benarroch,   2015).   Habenula   is   located   in   the   dorsomedial   portion   of   the   thalamus.   There   it   forms   an   essential   connection   between   the   forebrain   and   brainstem   monoaminergic  nuclei.  The  Hb  is  comprised  of  two  subdivisions:  lateral  and  medial.  The  main   difference  between  these  subdivisions  is  in  their  neurochemical  characteristics  and  connectivity.   Considering  the  fact  that  the  lateral  Hb  exerts  an  inhibitory  modulation  both  on  dopaminergic   neurons  of  the  substantia  nigra  pars  compacta  (SNc)  and  ventral  tegmental  area  (VTA),  it  is  not   hard   to   conceive   why   this   nucleus   is   of   interest   when   considering   regulation   of   learning   and   reward.   Interestingly,   it   has   been   shown   that   the   lateral   Hb   is   a   primary   source   of   negative   reward-­‐related  signals  to  DA  neurons.  Matsumoto  and  Hikosaka,  (2007)  recorded  from  neurons   in  Hb  while  monkeys  performed  a  visually  guided  saccade  task.  Many  neurons  in  the  lateral  Hb   exerted  a  phasic  response  to  no-­‐reward-­‐predicting  targets,  and  inhibition  for  reward-­‐predicting   targets,  showing  an  opposite  effect  as  earlier  found  in  midbrain  dopamine  neurons  (Fiorillo  et   al.,   2003;   Schultz   et   al.,   1997;   Schultz,   2015).   Furthermore,   electrical   stimulation   of   lateral   Hb   prompted   a   strong   inhibition   on   midbrain   dopamine   neurons   through   GABAergic   connections  

(4)

mediated  through  the  rostromedial  tegmental  nucleus,  providing  a  plausible  mechanism  for  the   findings  of  suppressed  activity  in  VTA  and  substantia  nigra  for  the  absence  of  predicted  reward   (Schultz,  1986).  

Studies  recording  single  cells  have  also  found  neurons  that  seem  to  code  for  action-­‐values  (Lau   &   Glimcher,   2008;   Samejima,   Ueda,   Doya,   &   Kimura,   2005;   Tai,   Lee,   Benavidez,   Bonci,   &   Wilbrecht,  2012).    Action-­‐values  are  an  important  concept  in  reinforcement  learning,  referring   to  the  assignment  of  probable  future  values  to  a  variety  of  possible  actions,  later  used  to  make  a   decision   about   the   most   favorable   option   in   a   certain   situation   (i.e.   Q-­‐values,   see   Sutton   &   Barto,  1998).  For  example,  Samejima  et  al.  (2005)  trained  monkeys  to  turn  a  lever  to  either  left   or   right.   By   manipulating   the   probability   of   high   reward   for   either   left   or   right-­‐choices,   the   authors  could  show  that  certain  neurons  in  the  striatum  coded  for  both  a  preferred  direction   and   the   action-­‐value   for   this   specific   direction.   It   has   also   been   shown   that   the   values   of   predictive  visual  cues,  chosen  with  either  the  left  or  right  hand,  are  represented  in  contralateral   ventral  prefrontal  cortex  (Palminteri  et  al.,  2009).  Considering  striatum’s  known  role  in  motor-­‐ action   control   (Cui   et   al.,   2014),   it   is   not   surprising   that   one   attractive   notion   is   that   action-­‐ values   are   predominantly   represented   in   motor-­‐related   areas,   such   as   motor   cortex,   supplementary   motor   cortex   and   supplementary   eye   fields   (for   saccades)   (Hunt,   Woolrich,   Rushworth,   &   Behrens,   2013;   Wunderlich,   Rangel,   &   O’Doherty,   2009).   This   is   very   intuitive,   seeing  how  these  areas  also  plan  the  actions  to  be  made,  leading  to  more  efficient  processing   attained  through  learning.  Nevertheless,  it  could  be  the  case  these  values  are  represented  more   ubiquitously   than   most   of   these   studies   report.   FitzGerald,   Friston,   and   Dolan   (2012)   showed   action-­‐specific  signals  in  ventromedial  PFC,  putamen,  insula,  thalamus  and  hippocampus  using   multivariate   bayes   (MVB)   (Friston   et   al.,   2008).   Similarly,   Vickery,   Chun,   and   Lee   (2011)   could   decode   the   feedback   (win/loss)   in   all   43   regions-­‐of-­‐interests   across   the   whole   cortex   using   multivoxel   pattern   analysis   (MVPA)   (Hanke   et   al.,   2009).   A   similarity   between   both   these   approaches  is  that  they  can  extrapolate  information  based  on  dispersed  neuronal  populations   that   are   particular   for   certain   processes,   and   does   not   require   focal,   spatially   coherent   activations.   Meanwhile,   conventional   univariate   fMRI   analyses,   assuming   that   more   blood-­‐ oxygen  level  dependencies  (BOLD)  equals  higher  reward  processing,  are  inherently  insensitive  to   these  types  of  signatures.    

For  accurate  updating  of  action-­‐values  an  agent  also  needs  devoted  structures  that  implement   credit-­‐assignments.   One   candidate   area   for   such   computation   is   lateral   orbitofrontal   cortex   (OFC).  During  the  interval  between  decision  and  reward,  lateral  OFC  neurons  are  relatively  quiet   (compared   to   dorsolateral   PFC   neurons).   During   feedback,   however,   lateral   OFC   neurons   become   relatively   more   active,   reflecting   the   current   choice   responsible   for   the   outcome   (Tsujimoto,   Genovesio,   &   Wise,   2009).   Tsujimoto   and   colleagues   propose   a   pivotal   role   for   lateral   OFC   in   reactivating   relevant   choice   representations,   assisting   Hebbian   learning.   In   addition   to   reactivating   choice   representations,   OFC   neurons   are   also   able   to   preserve   neural   representations  of  rewards  over  an  extended  period  of  time,  despite  presentation  of  distracting   reward   outcomes   (Lara,   Kennerley,   &   Wallis,   2009).   This   putative   function   of   lateral   OFC   is   further   substantiated   by   research   showing   that   lesions   to   the   lateral   OFC   impaired   monkeys’   ability   to   make   value-­‐related   decisions   between   objects   and   update   action-­‐values   based   on   current   feedback   (Rudebeck   &   Murray,   2011).   In   contrast,   the   removal   of   ventromedial   PFC   (vmPFC)  showed  no  such  impairment  (Rudebeck  &  Murray,  2011).  Instead,  BOLD  activation  in  

(5)

vmPFC  is  known  to  correlate  with  the  predictive  value  of  future  outcomes  (Kable  &  Glimcher,   2007;  Plassmann,  O’Doherty,  &  Rangel,  2007;  Tom,  Fox,  Poldrack,  &  Trepel,  2007),  as  well  as  the   subjective   value   at   the   time   of   reward   (Sescousse,   Redoute,   &   Dreher,   2010).   It   has   been   proposed  that  this  signal  reflects  a  comparison  between  different  possible  options  in  the  value   domain  (Boorman,  Behrens,  Woolrich,  &  Rushworth,  2009;  FitzGerald,  Seymour,  &  Dolan,  2009).   Contrary  to  vmPFC,  anterior  PFC  appears  to  encode  the  value  of  choices  that  were  not  selected   (Boorman   et   al.,   2009;   Rushworth,   Noonan,   Boorman,   Walton,   &   Behrens,   2011).   Specifically,   when  feedback  is  presented  about  the  value  of  the  alternative  choice  participants  could  have   made,   activation   in   anterior   PFC   reflects   the   prediction   error   of   this   counter   factual   choice   (Boorman,  Behrens,  &  Rushworth,  2011).  Additionally,  this  counter  factual  prediction  error  has   recently   been   observed   in   human   striatum   (Kishida   et   al.,   2015).   Kishida   and   colleagues   estimated  sub-­‐second  dopamine  fluctuations  through  Fast  Scan  Cyclic  Voltammetry  (Kishida  et   al.,   2011)   in   striatum   in   Parkinson   patients   while   performing   a   sequential   investment   game.   Dopamine   fluctuations   did   not   only   reflect   the   reward   prediction   error,   but   also   showed   a   combination   of   the   reward   prediction   error   and   the   counter   factual   reward.   In   fact,   earlier   studies   have   revealed   that   humans   use   both   counterfactual   information   (feedback   relating   to   choices  that  weren’t  made)  and  reward  prediction  errors  over  choices  that  were  actually  made   to   influence   their   up-­‐coming   decision   (Chiu,   Lohrenz,   &   Montague,   2008;   Lohrenz,   McCabe,   Camerer,  &  Montague,  2007).  To  contrast  the  function  between  vmPFC  and  anterior  PFC  even   more,  Daw,  O’Doherty,  Dayan,  Seymour,  and  Dolan  (2006)  reported  that  exploitative  behavior   of   high-­‐value   options   were   associated   with   vmPFC,   whereas   anterior   PFC   seemed   to   process   lower   values   during   exploration.   Reconciling   this   finding   with   earlier   findings   of   anterior   PFC   function  (Boorman  et  al.,  2009,  2011)  suggests  that  the  anterior  PFC  signal  during  exploration   (Daw  et  al.,  2006)  either  reflects  a  high  probability  of  switching  to  another  alternative,  or  the   high  value  of  the  discarded  options  while  exploring.    

In  the  literature,  there  seems  to  be  a  lot  of  overlap  across  a  variety  of  reward  processes.  Both   spectrums   of   signed   valence   are   coded   by   similar   neurochemicals,   within   similar   structures.   However,  processes  like  representation  of  action-­‐value  and  assignment  of  action-­‐value  differ  in   their   temporal   engagement,   as   well   as   their   probable   relevant   structures.   Nevertheless,   it   is   possible  they  have  more  in  common  than  perceived  at  first  sight.  As  pointed  out  above,  many   fMRI  studies  assume  that  external  variables  influence  neurobiological  measurements  in  a  linear   matter.  Specifically,  they  assume  that  the  BOLD-­‐response  adds  up  with  more  reward.  This  has   been   conjectured   based   on   psychometric-­‐neurometric   experiments   where,   for   example,   a   monkey’s  self  report  of  perceived  motion  direction  was  predicted  by  higher  activity  in  motion   area  MT  (Treue  &  Martínez  Trujillo,  1999),  or  that  objective  stimulus  intensity  was  associated  by   a  power  function  to  both  subjective  intensity  of  the  stimulus  and  the  BOLD  response  (Polonsky,   Blake,  Braun,  &  Heeger,  2000).  However,  this  assumption  is  not  necessarily  true  and  methods   like   MVPA   and   MVB   are   probably   more   sensitive   to   picking   up   the   subject-­‐specific   signals   associated   with   different   reward   processes.   Another   distinction,   that   might   explain   incongruences   in   result   found   between   single-­‐cell   studies   and   fMRI   studies,   is   the   nature   of   what   they   are   measuring.   Single-­‐cell   studies   usually   report   spike   rate   of   cells,   whereas   fMRI   studies  report  BOLD,  which  are  not  believed  to  reflect  spike  rates  but  rather  local  field  potentials   (LFP)  (Logothetis,  Pauls,  Augath,  Trinath,  &  Oeltermann,  2001).  While  spike  rates  correlate  with   neural  output,  LFPs  are  associated  with  subthreshold  activity  as  well  as  incoming  input  into  the   area   (Logothetis,   Pauls,   Augath,   Trinath,   &   Oeltermann,   2001;   Logothetis   &   Wandell,   2004;  

(6)

Logothetis,  2003).  Another  caveat  is  that  investigating  reward  is  not  a  straightforward  question;   one  main  limitation  is  the  temporal  and  spatial  overlap  reward  has  with  attention  in  the  brain   (Maunsell,  2004).  Stănişor,  van  der  Togt,  Pennartz,  and  Roelfsema  (2013)  report  such  a  finding   where   monkeys   were   trained   in   a   curve-­‐tracing   task   while   recording   neurons   from   V1.   The   curve-­‐tracing  task  allowed  the  researchers  to  manipulate  attention  and  reward  representation,   by  the  means  of  using  distractors,  and  a  comparison  between  the  two  showed  that  the  effects   of  relative  value  had  a  similar  timing  and  magnitude  as  the  effects  of  selective  attention.    The   authors   argues   that   their   findings   support   the   view   that   studies   that   examine   attentional   processes   on   one   hand,   and   reward   on   the   other,   actually   investigate   the   same   selection   processes.   Researchers   usually   train   monkeys   in   attention   paradigms   by   using   reward.   For   example,  the  monkeys  might  get  rewarded  for  one  (attended)  stimulus,  but  not  for  the  other   (unattended)   stimulus   (Stănişor   et   al.,   2013),   meaning   that   the   original   aim   to   investigate   attention  now  is  contaminated  by  reward  processing  as  well.    

 

THE  PERCEPTUAL  BRAIN  

Considering   the   vast   fluctuating   landscape   of   information   surrounding   us   all   of   the   time,   the   brain’s   ability   to   predict   and   quickly   structure   incoming   information   is   an   extraordinary   feat.     The  classical  view  of  the  perceptual  system  being  almost  purely  driven  by  bottom-­‐up  processes   has  been  heavily  challenged  in  recent  years.  In  addition  to  bottom-­‐up  input,  the  visual  cortex   also  receives  large  amounts  of  feedback  from  other  higher-­‐order  cortical  areas  (Harris  &  Mrsic-­‐ Flogel,   2013;   Muckli   &   Petro,   2013).   Thus,   a   notion   that   has   gained   more   traction   in   recent   decades   is   predictive   coding   (Hohwy,   2014;   Lee   &   Mumford,   2003;   Rao   &   Ballard,   1999).   Predictive  coding  states  that  top-­‐level  areas  continuously  send  predictions  to  the  early  sensory   processing  areas  in  a  hierarchical  manner,  which  has  been  shown  to  exert  faster  processing  of   incoming  stimuli  (O’Brien  &  Raymond,  2012).  For  example,  in  the  visual  cortex,  prior  predictions   evoke  a  preparatory  neural  template  of  the  expected  incoming  stimuli,  with  a  BOLD-­‐response   that   closely   resembles   the   BOLD-­‐response   caused   by   the   stimuli   (Kok,   Failing,   &   de   Lange,   2014a).   A   clear   example   of   predictions   that   affect   our   perception   can   be   found   in   binocular   rivalry.   Binocular   rivalry   is   a   perceptual   phenomenon   that   has   been   described   as   far   back   as   1593  by  Giambattista  Della  Porta  (Hohwy,  2014).  In  binocular  rivalry,  each  eye  receives  different   visual  input  and,  instead  of  the  two  images  fusing,  one  eye  becomes  dominant  resulting  in  one   clear  percept.  Eye  dominance  alternates  every  few  seconds,  sometimes  with  periods  of  patchy   transitions.   Findings   during   binocular   rivalry   strengthen   the   supposition   that   the   brain   is   engaged  in  high-­‐level  inferential  work.  First  presenting  the  same  image  to  both  eyes  and  then   changing  the  input  for  one  of  the  eyes  can  prime  which  eye  that  initially  dominates  perception.   The  eye  receiving  the  same  image  as  before  is  more  likely  to  be  dominant  than  the  eye  whose   input   has   been   switched   (Mitchell,   Stoner,   &   Reynolds,   2004).   In   1928   Emilio   Diaz-­‐Caneja   (Hohwy,  2014)  cut  two  images  in  half  and  presented  a  combination  of  the  two  for  both  eyes.   Interestingly,   even   now   the   perception   did   not   fuse,   but   instead   people   perceive   a   complete   picture  by  combining  the  one  half  from  one  eye  with  the  corresponding  half  from  the  other  eye   (Hohwy,   2014).   This   is   an   impressive   achievement   by   the   brain.   The   sophisticated   inferences   made   by   the   brain   as   shown   in   binocular   rivalry   findings   suggest   that   even   high-­‐conceptual  

(7)

notions  perturb  deep  down  into  our  low-­‐level  machinery,  affecting  our  perceptual  awareness  to   a  much  larger  extent  than  earlier  believed.    

According   to   a   contemporary   theoretical   framework,   the   main   goal   of   the   brain   is   to   predict   future  states,  and  thus  minimize  surprise,  in  order  to  effectively  process  and  interact  with  the   world  (Friston,  2009,  2010).  There  has  been  a  vast  amount  of  findings  showing  the  predictive   nature  of  the  brain  over  the  past  decades,  ranging  from  domains  such  as  visual  (Kok,  Failing,  &   de   Lange,   2014b;   Kok,   Jehee,   &   de   Lange,   2012)   auditory   (Cohen,   Elger,   &   Ranganath,   2007),   self-­‐recognition/embodied   self   (Apps   &   Tsakiris,   2014;   Seth,   2013)   and   somatosensory   perception   (Allen   et   al.,   2015).   Furthermore,   higher   order   functions   have   been   implicated   in   computing   probabilities   and   predictions,   such   as   action   preparatory   activity   (Bestmann   et   al.,   2008),   memory   (Kumaran   &   Maguire,   2009),   and   cognitive   control   (Pezzulo,   2012).   So   what   could  be  the  advantage  of  using  prediction  errors?  The  short  answer  again  is  to  aid  learning.  The   main  reason  the  notion  of  prediction  errors  is  so  fascinating  is  because  of  how  intuitively  and   ingeniously   it   can   describe   and   emulate   learning   in   computational   models   by   error   correction   (Sutton  &  Barto,  1998).  It  is  intuitive  in  the  sense  that  if  we  constantly  update  our  model  of  the   world   based   on   the   size   of   the   error,   we   will   gradually   increase   the   precision   of   our   future   predictions.   Of   course,   there   are   several   bottlenecks   in   the   processing   of   incoming   sensory   information  that  hinder  us  from  having  a  perfect  model  of  the  world  all  of  the  time,  but  given   the  limited  amount  of  data  we  can  process,  the  usage  of  prediction  errors  gives  us  a  fairly  good   estimate.  However,  despite  being  an  attractive  notion  both  from  a  computational  and  empirical   viewpoint,   some   inconsistent   results   can   be   used   to   argue   against   predictive   coding.   Rao   and   Ballard  (1999)  proposed  that  an  intelligent  system  would  not  be  surprised  by  predictable  stimuli   and  only  the  unexpected  input  features  are  put  forward  to  the  next  stage  of  processing.  Thus,  it   can   be   contended   that   this   is   problematic,   seeing   that   predictions   sometimes   rather   seem   to   boost  sensory  processing  (Chaumon,  Drouet,  &  Tallon-­‐Baudry,  2008;  Doherty,  Rao,  Mesulam,  &   Nobre,  2005).  However,  it  has  been  reasoned  that  these  findings  are  confounded  by  attention   (Kok,  Rahnev,  Jehee,  Lau,  &  De  Lange,  2012).  In  fact,  in  further  elaborations  of  the  predictive   coding   model   it   has   been   proposed   that   attention   increases   the   weights   for   certain   sensory   evidence   (Friston,   2009;   Kok,   Rahnev,   et   al.,   2012;   Rao,   2005),   leading   to   higher   precision   of   pertinent  incoming  information.    

There  are,  in  principle,  two  types  of  prediction  errors  that  have  been  discussed  in  the  literature   (Den   Ouden,   Kok,   &   de   Lange,   2012).   I   have   already   discussed   the   first   one,   the   motivational   prediction   errors,   expressing   degree   of   surprise   caused   by   a   particular   rewarding   scenario   (Fiorillo   et   al.,   2003;   Lak,   Stauffer,   &   Schultz,   2014;   Schultz   et   al.,   1997;   Schultz,   1986).   The   second   type   of   prediction   error,   and   theoretically   more   recent,   is   related   to   perception.   In   perception,   predictive   coding   is   believed   to   work   in   a   hierarchical   manner,   where   the   subsequent   level   of   processing   predicts   each   previous   level   of   processing.   Here   only   the   information  of  deviant,  non-­‐predicted,  bottom-­‐up  sensory  evidence  is  passed  along  to  the  next   level   of   analysis   (Rao   &   Ballard,   1999).   Consequently,   prediction   errors   are   believed   to   be   abundant  all  over  the  brain  and  a  highly  integrated  and  essential  part  of  all  levels  of  learning.  In   fact,  it  has  been  proposed  that  the  main  goal  of  the  brain  is  to  strive  to  minimize  the  amount  of   surprise  by  constantly  updating  the  internal  model  of  the  world  (Friston,  2009,  2010).  There  is  a   vast  amount  of  empirical  evidence  for  predictive  coding  processes  in  sensory  processing  areas   (Allen  et  al.,  2015;  Clark,  2013;  Jack  &  Hacker,  2014;  Kok  et  al.,  2014a;  Kok,  Rahnev,  et  al.,  2012;  

(8)

Rauss,   Schwartz,   &   Pourtois,   2011;   Shipp,   Adams,   &   Friston,   2013).   One   of   the   most   robust   paradigms   to   induce   prediction   errors   is   an   oddball   task,   where   a   divergent   stimulus   is   presented   after   a   sequence   of   repetitive   stimuli.   This   elicits   a   large   probability-­‐scaled   neural   response,  originating  from  the  sensory  areas  (Akatsuka,  Wasaka,  Nakata,  Kida,  &  Kakigi,  2007;   Stagg,  Hindley,  Tales,  &  Butler,  2004).  Perceptual  prediction  errors  are  also  distinguished  from   associated   concepts   like   adaption   and   attention.   This   can   be   shown   in   omission   paradigms,   where  a  predicted  stimulus  is  withheld  yet  still  yields  a  large  neural  response  in  relevant  sensory   areas  (Den  Ouden,  Friston,  Daw,  McIntosh,  &  Stephan,  2009;  Todorovic,  Ede,  Maris,  &  Lange,   2011).   Another   way   of   investigating   perceptual   prediction   errors   is   by   using   illusions.   For   example,   using   Kanizsa   illusions   (where   Pac-­‐Man-­‐shaped   inducers   are   aligned   such   that   the   edges  form  a  shape  that  trigger  the  percept  of  an  illusory  contour),  Kok  and  De  Lange  (2014)   showed  that  illusory  perception  of  shape  comes  with  elevated  BOLD-­‐response  in  visual  regions   where   bottom-­‐up   sensory   evidence   is   absent   but   part   of   the   shape   is   expected   (reflecting   a   prediction  error).  At  the  same  time,  they  found  that  in  regions  that  receive  bottom-­‐up  evidence   and  is  predicted  by  the  perceptual  shape  led  to  an  attenuation  BOLD-­‐response,  consistent  with   how  prediction  error  are  assumed  to  propagate  in  the  visual  hierarchy  (Rao  &  Ballard,  1999).  In   a   follow-­‐up   study   using   ultra   high-­‐field   fMRI   (7T),   Kok   et   al.   (2016)   were   able   to   separate   the   cortex   into   three   different   parts;   deep,   middle   and   superficial   layers.   Since   feedback   and   feedforward  connections  are  largely  segregated  in  the  visual  cortex  (Rockland  &  Pandya,  1976),   with   feedback   mostly   encompassing   the   deep   and   superficial   layers   and   feedforward   connections   encompassing   the   middle   layers,   Kok   and   colleagues   predicted   a   layer-­‐separated   response   depending   on   top-­‐down   or   bottom-­‐up   influence.   Indeed,   the   authors   found   that   bottom-­‐up   stimuli   almost   equally   activated   all   three   layers   whereas   top-­‐down   signals   (“predictions”)   showed   higher   activation   in   the   deep   layer.   Again,   this   clearly   follows   earlier   findings   that   expected   stimuli   are   attenuated   while   unexpected   bottom-­‐up   signals   are   enhanced.   Yet   another   illusion   that   also   shows   the   extraordinary   explainable   power   of   predictive   coding   is   the   McGurk-­‐effect   (McGurk   &   Macdonald,   1976).   The   McGurk-­‐effect   is   a   multisensory  perceptual  phenomenon  that  displays  a  collaborate  interaction  between  auditory   and  visual  areas  in  the  process  of  speech.    For  example,  when  an  auditory  stimulus,  for  example   “Ba”,  is  paired  with  the  visual  input  of  someone  saying  “Ga”  subjects  report  that  they  perceive  a   syllable   in   between   (like   “Da”).   In   line   with   predictive   coding   accounts,   the   more   predictive   a   visual   stimulus   is   of   the   subsequent   spoken   syllable,   the   stronger   the   response   in   superior   temporal   sulcus   when   this   prediction   is   violated   (Arnal   et   al.,   2009).   Both   of   these   illusory   accounts   demonstrate   the   hallmark   of   a   prediction   error,   as   described   by   Rao   and   Ballard   (1999).    

As   briefly   discussed   earlier,   an   additional   vital   aspect   of   learning   and   reward   is   attention.   Attention   is   believed   to   play   a   pivotal   role   in   prediction   errors,   increasing   the   weight   of   prediction   errors   (Friston,   2009;   Kok,   Rahnev,   et   al.,   2012),   which   in   turn   putatively   increases   processing  speed  of  sensory  evidence  and  directs  learning.  Reward  in  itself  seems  to  be  able  to   modulate   attention   as   such   that   it   increases   performance   in   visual   tasks   as   a   function   of   incentive   value   (Engelmann,   Damaraju,   Padmala,   &   Pessoa,   2009).     In   addition   to   improving   performance,   monetary   reward   also   concomitantly   boosts   BOLD-­‐response   in   task-­‐related   perceptual  and  cognitive  regions,  together  with  reward-­‐related  regions  (Engelmann  et  al.,  2009;   Pochon  et  al.,  2002;  Small,  2005).  Moreover,  dopamine-­‐related  areas  like  striatum,  in  addition   to   reward   prediction   (Schultz,   1986),   also   have   been   implicated   in   coding   for   incremental  

(9)

attention-­‐capturing   saliency   (Zink,   Pagnoni,   Chappelow,   Martin-­‐Skurski,   &   Berns,   2006).   Together  with  its  role  in  action-­‐initiation  (Cui  et  al.,  2014;  Shiflett  &  Balleine,  2011),  this  finding   suggests   a   facilitating   role   for   subcortical   dopamine   in   reallocating   attentional   resources.   Consequently,   Pessoa   (2009)   proposed   that   an   enhanced   interaction   between   subcortical   reward-­‐related  areas  and  perceptual  and  cognitive  regions  reallocates  attention  and  improves   performance   consequently   promoting   successful   reward-­‐seeking   behavior.   This   is   interesting   because   these   findings   suggest   that   the   allocation   of   attentional   resources   in   the   midbrain   provides  a  link  between  the  motivational  dopaminergic  prediction  errors  and  the  modal  specific   predictions   in   the   cortex.   However,   because   of   the   similar   nature   of   attention   and   reward   (Stănişor  et  al.,  2013),  attention  also  poses  a  methodological  problem  for  researchers.  The  main   caveat   of   reward   research   being   the   difficulty   that   comes   when   trying   to   disentangle   neural   reward   and   attentional   signals,   causing   many   reward   studies   to   be   confounded   by   attention   (Maunsell,  2004).  Thus,  one  vital  question  becomes  whether  or  not  learning  can  occur  without   attention.  Watanabe,  Náñez,  and  Sasaki  (2001)  trained  their  participants  in  a  letter  task,  and  at   the  same  time  presented  moving  dots  with  a  subthreshold  coherence  level.  They  showed  that   motion  direction  discrimination  later  on  was  improved  selectively  for  the  orientation  that  had   been   presented   during   the   letter   task.   It   was   later   showed   that   this   task-­‐irrelevant   visual   perceptual  learning  was  contingent  on  the  fact  that  the  task-­‐irrelevant  feature  was  presented   subthreshold  (Tsushima,  Seitz,  &  Watanabe,  2008).  Watanabe  and  colleagues  argued  that  if  the   task-­‐irrelevant  feature  were  presented  above  the  threshold  of  detection,  it  would  be  considered   a  distractor  and  attention  would  attenuate  the  irrelevant  feature  prohibiting  any  task-­‐irrelevant   learning  (Sasaki,  Nanez,  &  Watanabe,  2010;  Seitz,  Kim,  &  Watanabe,  2009;  Seitz  &  Watanabe,   2005;   Takeo   Watanabe   &   Sasaki,   2015).   Persichetti,   Aguirre,   and   Thompson-­‐Schill   (2015)   reported  a  slightly  different  finding  when  they  first  taught  participants  to  associate  novel  shapes   with   different   monetary   rewards.   Subjects   later   completed   an   unrelated,   but   demanding,   perceptual   task   using   the   same   shapes.   Curiously,   Persichetti   and   colleagues   showed   that   shapes  earlier  associated  with  high  reward  showed  an  increased  BOLD-­‐response  in  visual  cortex   despite  the  fact  that  attention  was  drawn  away  from  the  associate  value  of  each  shape.  

Predictions   and   prediction   errors   are   becoming   increasingly   popular   as   an   explanatory   framework  for  a  wide  range  of  neuropathological  diseases,  perceptual  experiences,  and  higher   cognitive   functions.   The   importance   of   predictions   becomes   clear   when   considering   what   happens  when  things  go  awry.  For  example,  erroneous  prediction  errors  have  been  proposed  to   underlie   adolescent   risk-­‐taking   (Cohen   et   al.,   2010)   and   high-­‐level   dysfunctions   (Simon   et   al.,   2015;  van  Boxtel  &  Lu,  2013).  One  such  is  psychosis  (Corlett,  Honey,  &  Fletcher,  2007;  Corlett  et   al.,   2007;   Corlett   &   Fletcher,   2015;   Yamashita   &   Tani,   2012),   where   individuals   stereotypically   report   disrupted   perceptual   experiences   such   as   brighter   colors   and   louder   sounds,   and   attention,   assigning   inappropriate   significance   to   these,   leads   to   delusions.   These   experiences   are  all  congruent  with  an  inability  to  explain  away  incoming  stimuli  due  to  erroneous  prediction   errors.   Another,   maybe   even   more   surprising,   disorder   proposed   to   be   caused   by   weak   prediction  errors  is  autism  (Pellicano  &  Burr,  2012;  Sinha  et  al.,  2014;  van  Boxtel  &  Lu,  2013).  In   sensory   systems   of   people   with   autistic   spectrum   disorder,   weak   prediction   errors   lead   putatively  to  a  perpetual  shower  of  new  “surprises”,  causing  an  increase  of  sensory  inputs  for   the  brain  to  process.  

(10)

THE  M O DELLED  BRAIN  

A  hallmark  of  true  understanding  of  a  mechanism  is  the  ability  to  reproduce  the  process  from   the  ground  up,  exhibiting  the  same  properties  when  exposed  to  the  same  situations.  Models  of   the  brain  can  be  used  to  show  understanding  of  the  underlying  principles,  helping  us  to  further   develop  tools  to  investigate  other  processes  of  the  brain  and  predict  outcomes  of  situations  by   simulations.  One  of  the  pioneers  of  modeling  the  visual  system  was  David  Marr.  He  advocated   for  viewing  the  brain’s  visual  organization  as  a  pure  information  processing  system,  proposing   the   notion   that   one   must   understand   it   at   three   distinct,   complementary   levels   of   analysis   (known   as   the   Marr's   Tri-­‐Level   Hypothesis)   (McClamrock,   1991).   The   first   level   is   the   computational  level:  what  is  the  function  of  the  system?  What  types  of  problems  does  it  need  to   solve  and  overcome?  And  why  is  it  doing  these  things?  The  second  level  is  the  algorithmic  level:   how   are   these   functions   represented   in   the   brain,   and   what   kind   of   processes   are   used   to   manipulate  the  representations?  The  third  level  is  the  level  of  implementation:  how  are  these   functions  realized  in  the  brain?  That  is,  which  neural  structures  and  neural  activities  instigate  the   algorithms  and  processes  that  solve  the  problems  of  the  system?  These  levels  are  not  specific   for  the  visual  system,  but  can  be  applied  as  a  general  rule  to  understand  the  whole  brain.  On  top   of  these  levels,  Tomas  Poggio  proposed  the  level  of  learning  (Poggio,  2012):  the  level  at  which   the  system  learns  how  to  process  information  in  an  adequate  manner,  without  the  need  to  be   preprogrammed  for  the  specific  task.  This  is  exactly  where  sufficient  and  necessary  models  of   reinforcement  learning  should  be  implemented.    

Models   of   reinforcement   learning   traditionally   arise   from   psychology   and   computational   science,   where   researchers   tried   to   understand   the   brain   by   either   testing   behavior   or   constructing   artificial   intelligence.   More   than   hundred   years   ago,   Ivan   Pavlov   observed   in   his   famous  experiment  on  the  salivating  dog,  that  when  the  ring  of  a  bell  is  consistently  paired  with   food,  dogs  eventually  start  to  salivate  after  the  bell  is  rung  (Rescorla  &  Solomon,  1967).  This  is   known  as  classical  condition,  in  which  an  innate  response  (salivating)  to  a  potent  stimulus  (food)   comes  to  be  prompted  by  a  previously  neutral  stimulus  (the  sound  of  a  bell).    The  first  people  to   mathematically  formalize  this  learning  process  were  Bush  and  Mosteller  (1951)  who  proposed   that  the  probability  of  the  dogs  salivating  could  be  expressed  as  an  iterative  equation:  

𝐴

!"#$_!"#$%

= 𝐴

!"#$_!"#$%

+  α(𝑅

!"##$%&_!"#$%

− 𝐴

!"#$_!!"#$

)

Where   you   compute   the  𝐴!"#$_!"#$%  by   taking   the   value   of   A   for   the   last   trial   with   the   added  

discrepancy   between   the   value   of   the   current   (actual)   and   last   trial   (expected)   value,   i.e.   the   prediction  error,  multiplied  by  some  learning  rate  α  between  0  and  1.    When  α  is  equal  to  1,  A  is   always  updated  so  that  it  is  equal  to  R  from  the  last  trial.  In  fact,  as  long  as  α  >  0  and  α  <  0.5,  the   value  of  A  will  always  converge  to  the  value  of  R.  However,  the  smaller  the  learning  rate,  the   slower   will   this   converging   be.   So   in   fact,   what   the   Bush   and   Mosteller   and   equation   do   is   to   compute  a  reward  average  based  on  previous  trials,  where  the  recent  trials  carry  more  weight.   The  learning  rate  dictates  the  decay  of  this  weight  along  previous  trials,  with  higher  learning  rate   rendering   the   predictive   process   to   be   less   influenced   by   older   trials.   The   importance   of   the   Bush  and  Mosteller  equation  is  non-­‐trivial;  it  was  the  first  to  utilize  an  iterative  error-­‐based  rule   for  reinforcement  learning,  forming  the  keystone  for  most  future  models.  In  an  extension  of  the   Bush   and   Mosteller   rule,   Rescorla   and   Wagner   (Sutton   &   Barto,   1998)   build   a   learning   model  

(11)

that   was   used   to   investigate   associative   connections   when   two   predictive   cues   where   paired   with  the  same  event.  Their  model  has  become  so  widely  prominent  that  many  now  fallaciously   attribute  the  Bush  and  Mosteller  equation  to  Rescorla  and  Wagner  (Glimcher,  2010).  

With   time,   two   key   issues   with   these   models   emerged   (Sutton   &   Barto,   1998).   First,   they   all   treated   time   as   discrete   epochs,   where   learning   happens   in   the   end   of   each   epoch   (or   trial).   However,   in   the   real   world,   time   is   continuous,   meaning   that   several   things   within   a   trial   can   carry  meaning  for  the  end  result.    The  second  issue  concerns  the  linking  of  sequential  cues.  For   example,  the  earlier  models  were  good  at  using  the  conditioned  cue  to  predict  the  value  of  the   trial,   while   not   incorporating   the   notion   that   the   later   appearance   of   the   reward   was   non-­‐ informative.  Sutton  and  Barto  (1998)  maintained  that  one  of  the  main  problems  with  the  earlier   models  was  that  the  definition  of  the  problem  the  models  were  trying  to  solve  were  incorrect,   and  thus  violating  the  first  level  of  Marr’s  Tri-­‐Level  Hypothesis.  The  goal  is  not  to  learn  the  value   of  previous  events,  as  Sutton  and  Barto  stated  that  the  Bush  and  Mosteller  rule  actually  did;  it  is   to   try   to   predict   future   events.   In   their   temporal   difference-­‐learning   model   (Sutton   &   Barto,   1998;   Sutton,   1988),   the   prediction   error   is   computed   by   taking   the   difference   between   the   prediction  of  all  future  rewards  and  any  information  that  leads  to  an  alteration  of  beliefs.  This   information   is   not   constrained   only   to   direct   unconditioned   reward,   but   also   signals   that   are   predictive  of  upcoming  rewards.  This  is  a  critical  difference  from  having  a  prediction  error  that  is   computed   by   the   difference   of   past   events   and   just   the   current   reward,   as   in   the   Bush   and   Mosteller  classes  of  learning  models.  Furthermore,  learning  did  not  happen  after  each  epoch,   but   because   time   was   represented   as   a   series   of   minimally   discrete   moments,   the   predictive   model  was  updated  whenever  salient  and  relevant  events  occurred.  Furthermore,  for  each  time   step,  not  only  does  this  model  carry  a  prediction  for  reward  at  that  very  moment,  but  also  the   predicted  sum  of  the  discounted  reward  for  all  subsequent  moments.  To  illustrate  this,  imagine   a  situation  where  time  is  divided  up  in  discrete  time  points.  At  any  point  in  time,  a  reward  has   an  equally  low  probability,  meaning  that  any  reward  at  any  time  will  yield  a  big  prediction  error   response.  Now,  imagine  that  a  tone  starts  to  occur  just  before  every  reward.  The  first  time  this   happens,  the  tone  carries  no  information  about  the  subsequent  reward,  which  is  at  this  point   still   surprising.   Gradually   over   time,   the   tone   completely   predicts   the   reward   and   the   actual   reward   does   not   carry   any   additional   information.   The   prediction   error   now   starts   to   occur   together   with   the   tone,   because   of   the   unpredictability   of   its   timing.   Temporal   difference   learning-­‐models  achieve  this  by  assigning  each  obtained  reward  not  just  to  the  value  function   for   the   current   moment   in   time   but   also   to   previous   time   increments.   So,   models   of   reinforcement  learning  (Sutton  &  Barto,  1998)  can  be  described  to  follow  three  steps:  (1)  the   organism   estimates   value   of   each   action,   (2)   the   action   is   selected   based   on   a   comparison   between  several  action-­‐values,  and  (3)  action-­‐values  get  updated,  based  on  the  prediction  error   generated   by   the   current   event.   Temporal   difference-­‐learning   models   have   gained   traction   in   neuroscience  in  large  part  because  of  the  work  of  Schultz  et  al.  (1986),  who  recorded  dopamine   neurons  in  the  midbrain  of  monkeys  engaging  in  a  reinforcement-­‐learning  task.  The  monkey  was   seated   in   front   of   two   levers,   each   lever   having   one   corresponding   light   cue.   After   the   illumination  of  the  light  cue,  the  monkey  received  a  juice  reward  if  he  pressed  the  correct  lever.   In   the   early   stages   of   the   task,   neurons   were   silent   at   the   start   cue   but   responded   strongly   whenever  the  monkey  received  a  juice  reward.  As  the  monkey  continued  to  perform  the  task,   and  the  learning  process  evolved,  both  the  behavior  and  the  activity  of  the  neurons  changed.   The  monkeys  started  pressing  the  correct  levers  more  often,  and  the  neurons  stopped  firing  at  

Referenties

GERELATEERDE DOCUMENTEN

strain Design Method has been designed by a Reunion Internationale des Laboratoires d'Essais et de Recherches sur les Materiaux et les Constructions (RILEM) committee

The aim of the proposed study was to investigate the perceptions of employees to HIV/AIDS stigma and discrimination as well as their attitudes and behaviour towards fellow colleagues

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Ligt de PICC lijn op de juiste positie, dan wordt het uitstekende deel met een speciale pleister (Statlock) bevestigd aan de bovenarm.. De PICC lijn is op maat afgesneden, daarom

In this study we focus on 3 types of omics data that give independent information on the composition of transcriptional modules, the basic building blocks of

Keywords: drug delivery, blood-brain barrier, nanoparticles, cell encapsulation, focused ultrasound, brain

areas linked to the procedural, goal, imaginal and retrieval modules, cannot be de- tected as a whole based on the structural measures tested.. It might look like that is the case

Recent years more evidence showed that microglia not only clear the debris of apoptotic cells, but also that they may be able to actively induce the developmental death of