Future Models for Reward Motivation

Motivation for Well-Being

Sigmund Freud’s earlier tenet asserted that people (are motivated) and aim to attain pleasure, avoid pain, and maintain stability and equilibrium (Freud, 1912, p. 38 & 4). People (are motivated) and prefer to be in certain states of pleasure (of positive well-being) and seek to avoid unpleasant, painful states (of negative well-being) (Stein & Levine, 1990). In fact, “much of behavior is carried out in the service of achieving and maintaining goal states… that people prefer to be in … (e.g. those producing pleasure) and … avoid other states (e.g. those producing pain)” (Stein, Trabasso, & Liwag, 1993-page 281).  People therefore set goals for attaining and maintaining valued internal states of pleasure and well-being and those that avoid unpleasant states and perceived pain.  This motivation can be modeled accordingly, where the desire for the well-being ideal (Wbideal ) and valued state (V(s))is manifested in a goal that will later drive behavior to achieve it.

This motivation can be modeled accordingly, where the desire for the well-being ideal (Wbideal ) and valued state (V(s))is manifested in a goal that will later drive behavior to achieve it.

Wbideal =  Pleasure

Wbideal =  V(s)

Where the sense of well-being (Wbideal) is also the valued state (V(s)) in many different human dimensions, including, but not limited to, internal states like psychosocial processes, social comparisons, reflected (self-other) appraisals, coping strategies, psychological centrality, etc. (Keyes, Shmotkin, & Ryff, 2002).   One's anticipated or future sense of well-being (V(st+1)) is enhanced by the prospect of future reward and of gaining increased sense of well-being with acquiring a return ( γ ∑ rt+1).  This is represented accordingly.

Where: Wbideal = V (s)

V (s)  = V(st+1) + γ ∑ rt+1

Where: 0 ≤ γ ≤ 1 .

The discount factor, γ, refects weighting such that immediate rewards are weighted more heavily and are more highly valued than those that are distant and whose outcome is uncertain. Furthermore, not only does a return (reward or reinforcement) have an ability for eliciting  and enhancing  positive well-being, but it also has an incentive role as an attractor (or motivator) and serves a positive feedback function in supporting a person’s or agent’s learning about a task. In the algorithms noted below, the return component (γ  rt+1) takes on both roles for incentive and anticipated potential feedback.

According to Stein, Trabasso, & Liwag (1993) people attend to, monitor, identify, understand the context and conditions of, evaluate features of, and attribute meanings and relevance to an unfolding event (Ec). These cognitive processes and others underlie later problem-solving and strategy development (π) necessary for later task-related behavioral implementation or action (at+1).

In order to acquire a sense of well-being, a need for behavioral completion or implementation of some task requirement is presented in the environmental condition  (Ec).  This necessitates the development of a task-related strategy (π).

The task-related strategy or policy, (π), is composed of four functions that emanate from external conditions.  The first function involves the features of a stimulus and surrounding contextual stimuli.  The primary stimulus  and peripheral (contextual) stimuli are characterized by having perceptual (e.g. visual, auditory, tactile, gustatory, and olfactory), aversive (e.g. being painful, alien, etc.), rewarding (e.g. having salience or being novel, habitual, comforting, soothing, etc.), or neutral (e.g. having no effect) feature qualities.  Such perceptual, aversive, or rewarding feature qualities are impacted by varying levels of temporal intensity, brevity, duration, intermittence, frequency, activity, temporal delay, temporal trace, ability at imputing associative strength between stimuli, etc.  The manner, in which, stimuli interact with one another in these dimensions , produce synchronous or conflicting patterns that underlie the expression of a second function called a parameter.  Third and fourth functions involve spatial relationships between stimuli, e.g. approach, avoidance, etc. and the motivational state of the subject or organism, e. g.  hunger, thirst, frustration, perceived sense of loss and of well-being, etc. All these parametric functions comprise the task-related strategy or policy (π).

The task-related strategy's association with the future potential sense of well-being is embodied in the well-being (Wbπideal) and valued strategy state (Vπ(s)) characters.  The valued planning state (Vπ(s)) is therefore the means and the venue for attaining and anticipating positive well-being.

Available strategic alternatives during problem-solving are represented by the anticipated or  future valued strategy state (i=1Vπ(st+1).  The preferred strategic planning state, Vπ(s), evolves from the sum of all strategic alternatives coupled with all anticipated or available future discounted returns.

Vπ(s) = ∑i=1Vπ(st+1 ) + γ ∑i=1rt+1

The challenge for each task is to identify a strategy and its state that is needed to satisfy the completion of a task (Vπ(s)). Sutton and Barto (1998) suggested that the valued planning state (Vπ(s)) can also be modeled in such a way where learning takes place iteratively over many different episodes and trials (i=1) and the discounting effects of reinforcement trials ( γ  rt+1) (i.e. the present value of future rewards) serve to shape (by both incentive and projected feedback capabilities) the future generation and selection of an agent's actions (a) and states (s).

The person or agent develops an internal model, (s), (or state) of cogntive responses (π) to environmental conditions  (or state plan or strategy) at any point in state time (st).  He or she mentally examines the relationships between the variables and manipulates these conditions during the course of problem-solving and strategy development to reach the desired state plan strategy that will satisfy task requirements and yield the return of well-being. This internal model is a culmination of all prior, current, and future equivalent learning experiences.

Wbπideal = Vπ(s)

And Vπ(s) = Vπ(sideal)

Vπ(s) = ∑i=1Vπ( st-n , at-n) + V π( st, at ) + Vπ(st+1).

Where the valued strategy state, Vπ(s), will be the selected strategy state and will be considered to be the ideal strategy for realizing future return.  The current model is also based on previously learned concepts and states that were, is, and will be associated with strategic planning. An Internal Model

People typically compare their internal model (and conceptualization) of cumulatively remembered responses of themselves and others in their world with that of the outside world (or environmental condition) through pattern-matching procedures (Stein & Levine, 1990).  This internal model of one's world embodies each person's learning about relationships of stimuli and others and the positive feedback capability of previously delivered returns.  This internal model provides a context for developing future expectations of events and occurrences, Vπ(st+1).

When an expectation for pleasurable, desired, and valued states (of positive well-being) has been attained (goal attainment or success), a match between that which had been expected and what had occurred is experienced.  The strategy underlying this successful match then becomes incorporated and integrated into the internal model's cumulative strategy state, modifies the prior state, (i=1Vπ( st-n , at-n), and morphs into a new modified, integrative state, (i=1Vπ( s , a), which is distinctly different from the previous and original state.

Where i=1Vπ( st-n , at-n) + Vπ(s)  →  ∑i=1Vπ( s , a )

This new internal model is drawn upon in the course of future learning.  The valued state and valued strategy states are two of many states that are a part of the internal model. Where:

V(s)∈ Iminner and  Vπ(s)∈ Iminner.

In response to an environment's outcome responses and events, people evaluate an occurrence's meaningfulness, relevance, and congruence, or disparity with their internal belief structure of their internal model, identify the discrepancy (if any) between what was expected to occur and that which had occurred (match-mismatch), and assess (and conceptualize in an appraisal) the relevance of the outcome with goal status, the need for maintaining or modifying goals, and the stability or and certainty for goal success or failure (Stein, Trabasso, Liwag, 1993). This cognitive activity is conceptualized and later reaches some form of consciousness when (primary and secondary) appraisals are subsequently generated.

Component comparisons between the internal model and the environmental condition are important for assessing the accuracy between the two. Monitoring the disparity between the two helps to assess the nature of later responses (e.g. emotion, cognition, and behavior). Therefore, the algorithm's valued state of the environment, V(st), is useful for monitoring the strategy state at time, (t), that is inherent in current real-time conditions, Eπc ,and in the current return, rt , conditions.

Where, V(st)  =  Eπc + rt.

The valued strategy state, Vπ(s), has the capacity for obtaining future well-being as well, as the organism seeks to match the valued strategy state, Vπ(s), to the required environmental strategy, Vπ(st) to obtain the reward or return (rt+1).

The value of the strategy state is determined by the net sum of all prior states and actions plus the sum of all discounted future returns (with both incentive and projected feedback valuations), plus the anticipated policy state less the environmental outcome response, which is also the valued strategy state at time, t.  The product of  Vπ(st+1) - V(st) is the  selected policy state, whose result will later be added to i=1 Vπ( st-n, at-n) and will be reflected in the immediate learning state of the internal working model, Vπ(s).

Vπ(s) = ∑i=1 Vπ( st-n , at-n) + ∑i=1 γ rt+1 + ( Vπ(st+1) - V(st)) .

When expectation for the pleasurable, desired, and valued state (of positive well-being) has been attained (goal attainment or success), a match between that which had been expected and what had occurred is experienced.

Where, Vπ(s) = i=1 Vπ(st-n,at-n)  +i=1 γ rt+1 + (β -1) ( Vπ(st+1 , at+1) - V (st) .

And where ( β - 1 ) is an estimate of the prediction's accuracy and stability and ( Vπ(st+1 , at+1) - V (st) is the temporal difference.

The temporal difference of this equation can be isolated from the rest of the formula; the result, (ΤDIFF), or the match of the expectation or mismatch can be monitored and measured accordingly.

ΤDIFF = (β -1) ( Vπ(st+1 , at+1) - V (st)

The future return, i=1γ r t+1, is realized, rt, when the temporal difference is zero, ΤDIFF = 0, and when Vπ(st+1, at) - V(st) = 0. This conceptualization reflects that a match between what was expected, had also been delivered.  The match not only allows for reward delivery, but also supplies feedback on the appropriateness and correctness of the selected and implemented strategy.  Because the product is a function of a match, it is represented as  f,  f  (0) = (0 , 0), suggesting that there is no error and other responses to the match.

The future return is not realized when, ΤDIFF > 0 or when Vπ(st+1, at+1) - V(st) > 0 . This conceptualization reflects the existence of a mismatch between that which had been predicted and that which had occurred. In this situation the algorithm's product is more than zero, because the product's representation is a real number, ℜ.  Because the product is a function of a mismatch, it is represented as  f,  f  (1) = (0 , 1), suggesting that there is a disparity and other responses to the mismatch.

To summarize, the prediction's accuracy and stability, β - 1, are confirmed with a match.  A match reflects that the difference between an expected outcome and delivered reward results in a product of 0. When the prediction's accuracy and stability, β - 1, are in question, the result will later be confirmed with a mismatch.  A mismatch reflects that the difference between an expected and delivered outcome results in a product that is greater than 0.

The emotional learning experience can be summarized as follows.  The valued state at any point in time is therefore the sum of all valued prior strategy states and actions in the internal model, all discounted anticipated future returns (rt+1 ) with selecting the ideal strategy state, the prediction's stability (β-1), and, the selected ideal strategy state less the actual value of the strategy state.  The latter portion of the model is a temporal difference component, whose response state is embodied in the outcome.

The experience and maintenance of the valued state imputed in the match between expectation and outcome, typically results in the later expression of the emotion of happiness. Ultimately, people (and animals) are motivated to develop goals and have an incentive for experiencing well-being and the emotion of happiness or joy (McClelland, 1985). People develop goals for acquiring tangible and rewarding objects (e.g. tasty foods, desired melodic music, visually appealing items, etc.) or intangible rewards generated during interpersonal social interactions (e.g. a smile from another, a verbal acknowledgment validating one's value, a hug confirming love and acceptance, etc.).

Equation Application

Let's apply the algorithm to a real life situation and interpersonal interaction between two individuals. In order for a toddler to earn a sense of well-being and parental soothing (V(s)) from a much needed parental hug, Ec or rt+1, a toddler is told that he is required to put his toys in a toy chest (at+1). The young child will develop a valued strategy state (Vπ(s)) to satisfy this parental request, which has been based on the sum of all similar prior behavioral strategies (i=1 Vπ(st-1,at-1) and actions.  The reward of a parental hug, Ec or perceived reward (i=1 γ rt+1) and rewarded state (V(st+1))will be granted by Ec with the task completion (at+1) of putting the toys away. The algorithm would be reflected accordingly.

Vπ(s) = i=1 Vπ(st-n,at-n) +i=1 γ rt+1 + ( Vπ(st+1 , at+1) - V(st, at) )

Where the toddler's valued strategy state for complying with the request for picking up the toys, Vπ(s), is based on total prior learning, i=1 Vπ(st-n,at-n), and the perceived need and the anticipated valued state for maternal nurturance and a hug,i=1 γ rt+1, and the selected strategy for picking up the toys Vπ(st+1, at+1) less the actual strategy required by the environment (i.e. mother), Vπ(st, at). Once the toddler picks up all the toys and receives feedback (the hug or rt) that the action and the strategy underlying it was correct, the Vπ(s) integrates with the Vπ(st-1,at-1). This integration reflects that learning had taken place.

The Reward of Avoiding Fear

However, people are also motivated to avoid pain and experiencing the emotion of fear (Freud, 1912, p. 38; Salachs & Malfaz, 2006) or anxiety (McClelland, 1985). The removal of something negative is perceived as rewarding, in so far as its removal fosters a relief-driven sense of positive well-being.  This is known as negative reinforcement (Rachlin, 1935; Baron, 1991). Motivational goals are therefore developed for optimizing the well-being of each individual by facilitating the avoidance of (potentially) painful experiences of pain and of acute fear/anxiety.

Like Sutton & Barto's (1998) valuation of state and action in their reinforcement model, V( s, a ), Salichs & Malfz (2006) used concepts from Q-values of state and action in their Q-learning model to conceptualize how fear and harm avoidance later elicited and optimized well-being.  Fear was presumed to play a preparatory role in averting subsequent negative alterations in well-being.  Their model also presumed prior experience with a fear arousing object  (Qobjworst ), as the agent was able to anticipate a worst case scenario based on some prior experience or knowledge.  According to Salichs & Malfz (2006) the fear sequence update proceeded as follows,  Qobjworst ( s,a ) = min (Qobjworst ( s,a ), r + γ maxa∈Aobj Qobj ( s',a ))).

Where a was one of many actions (A) (a ∈ A) that could facilitate the return to well-being, s' was the new state that occurred in response to having taken (a) action, r was the return or reinforcement (valuation that was capable of eliciting later well-being), and γ was a discount factor.  According to Salichs & Malfz (2006) the emotion of fear (Qobjfear ( s,a)) helped to mobilize an agent into selecting actions that could reduce an environment's or objects harm, Qobjworst.   A fear-reducing action was embodied accordingly af = arg max Qobjfear ( s,a ), where the selected action (toward a fear-inducing object) is a function of the maximal amount of fear that is expressed in a state and action .

Therefore, Qobjfear ( s,a ) = β Qobj ( s,a ) + (1 - β ) Qobjworst ( s,a ), where the expression of state and action of fear to an object can best be understood as emanating from a response to an actual fear-inducing object less the potential worst case scenario in interaction with that fear-inducing object.  The difference would yield the actual fear-related strategy selected, much in the way that the following equation,( ε Vπ(st+1 , at+1) - V(st, at), is the selected strategy for achieving the valued strategy state for well-being.

The Q modeling model can be converted to a V( s ) (valued state) and object with harm modeling nomenclature .  This can also be described as follows.

Vπ (s,a )fear = ∑i=1 Vπ( st-n,at-n )fear + γ ∑ rt+1( ε Vπ-harm(st+1, at+1)worst  −  V(s)object

Where the optimal and valued strategy and strategy-based state and action for increasing well-being in response to potential fear is the sum of all prior effective strategies and the discounted future returns (with attaining anticipated well-being) and the strategy that is linked with with the occurrence of the worst case scenario less the present response of the the fear-inducing object.

A person is prepared for later confrontation with a fearful stimulus or object when one has developed a strategy or action plan to deal with the worst case scenario. The more effective the strategy and action plan, the more likely the person will realize reward-related future well-being. This is embodied in the following equation. (β -1) Vπ (st+1, at+1) - V object <= 0. Where, the mastery over a fear-producing object will result in achieving a return or sense of well-being, γ ∑i=1 rt+1, and a product which is equal to or smaller than 0. The person who is prepared for an adverse experience is most likely able to manage or escape from that adverse experience and realize the future return of well being, γ ∑i=1 rt+1.

But what if the person is ill-prepared (in response to an ineffective strategy and action plan) for an adverse experience and fear producing object overwhelms and conquers, the fear producing experience will likely continue unabated. This is embodied in the following equation. (β -1) (Vπworst(st+1, at+1) - Vobject > 0. Where the inability to control (through strategy and action) the fear producing situation will likely result in fear persistence, Vfear, a product greater than 0, and require a strategy modification, Vπfear.

References

Baron, A. (1991). Avoidance & punishment.  In: I.H. Iverson, & K.A. Lattal (Eds.) Techniques in the behavioral and neural sciences-Experimental analysis of behavior-part 1, pp. ( 173-217 ) New York, New York: Elsevier.

Freud, S. (1912). Beyond the pleasure principle.  In: C.J.M. Hubback (Ed.) The international psycho-analytical library-no. 4.  London, England: The International Psycho-Analytic Press.

Freud, S. (1989). Beyond the pleasure principle.  In: P. Gay (Ed.) The Freud reader.  New York: W.W. Norton & Company.

Keyes, C.L., Shmotkin, D. & Ryff, C.D. (2002). Optimizing well-being; the empirical encounter of two traditions.  Journal of Personality & Social Psychology, 82(6), 1007-1022.

McClelland, D.C. (1985). Human motivation.  Glenview, Ill.: Scott, Foresman, & Co. Chapters 4 & 10.

Rachlin, H. (1935). Behavior & learning.  California: W.H. Freeman & Co.

Salichs, M.A. & Malfaz, M. (April, 2006). Using emotions on autonomous agents: the role of happiness, sadness, and fear.  Integrative approaches to machine consciousness, part of AISB 2006: Adaptation in Artificial and Biological Systems.  Bristol, England.

Stein, N.L., & Levine, L.J. (1990). Making sense out of emotion: the representation and use of goal-structured knowledge.  In: N.L. Stein, B. Leventhal, & T. Trabasso (Eds.) Psychological and biological approaches to emotion pp. 45-73.  Hillsdale, New Jersey: Lawrence Erlbaum Pub.

Sutton, R.S. & Barto, A.G. (1998). Reinforcement learning: an introduction.  Cambridge, Mass: M.I.T. Press.