Learning Behavior Selection by Emotions and Cognition in

Learning Behavior. Selection by Emotions and Cognition in a Multi-Goal Robot Task Sandra Clara Gadanho Presented by Jamie Levy 1

Purpose n Build an autonomous robot controller which can learn to master a complex task when situated in a realistic environment. ¨ Continuous time and space ¨ Noisy sensors ¨ Unreliable actuators 2

Possible problems to the Learning Algorithm: Multiple goals may conflict with each other n Situations in which the agent needs to temporarily overlook one goal to accomplish another. n Short-term and long-term goals. n 3

Possible problems to the Learning Algorithm (cont): May need a sequence of different behaviors to accomplish one goal. n Behaviors are unreliable. n Behavior’s appropriate duration is undetermined, it depends on the environment and on their success. n 4

Emotion-based Architecture n n Traditional RL adaptive system complemented with an emotion system responsible for behavior switching. Innate emotions define goals. Agent learns emotion associations of environment-state and behavior pairs to determine its decisions. Q-learning to learn behavior-selection policy which is stored in Neural Networks. 5

ALEC – Asynchronous Learning by Emotion and Cognition n Augments the EB architecture with a cognitive system. ¨ Which has explicit rule knowledge extracted from environment interactions. n Is based on the CLARION model by Sun and Peterson 1998. ¨ Allows learning the decision rules in a bottom up fashion. 6

ALEC architecture (cont) n n Cognitive system of ALEC I was directly inspired by the top-level of the CLARION model ALEC II has some changes (to be discussed later) ALEC III emotion system learns about goal state exclusively while cognitive system learns about goal state transitions. LEC (Learning by Emotion and Cognition) is non -asynchronous used to test usefulness of behavior switching. 7

EB II Replaces emotional model with a goal system. n Goal system is based on a set of homeostatic variables that it attempts to maintain within certain bounds. n 8

n n n EBII Architecture is composed of two parts: Goal System Adaptive System 9

Perceptual Values Light intensity n Obstacle density n Energy availability n ¨ Indicates whether a nearby source is releasing energy 10

Behavior System n Three hand-designed behaviors to select from: ¨ Avoid obstacles ¨ Seek light ¨ Wall following n These are not designed to be very reliable and may fail ¨ Ex) wall following may lead to a crash 11

Goal System n n Responsible for deciding when behavior switching should occur. Goals are explicitly identified and associated with homeostatic variables. 12

Three different states – -target -recovery -danger 13

Homeostatic Variables n n n Variable remains in its target as long as its values are optimal or acceptable. Well-being variable is derived from the above. Variable has effect on well-being. 14

Homeostatic Variables n Energy ¨ Reflects n the goal of maintaining its energy Welfare ¨ Maintains n goal of avoiding collisions Activity ¨ Ensures agent keeps moving; otherwise value slowly decreases and target state is not maintained 15

Well-Being n n n State Change – when a homeostatic variable changes from one state to another the well-being is positively influenced. Predictions of State Change – when some perceptual cue predicts the state change of a homeostatic variable, influence is similar to above, but lower in value. These are modeled after emotions and may describe “pain” or “pleasure. ” 16

Well-Being (cont) n n cs = state coefficient rs = influence of state on well being. 17

Well-Being (cont) n n ct(sh) = state transition coefficient wh = weight of homeostatic variable ¨ 1. 0 for energy ¨ 0. 6 for welfare ¨ 0. 4 for activity 18

Well-Being (cont) n n cp = prediction of coefficient rph = value of prediction ¨ Only considered for energy and activity variables 19

20

21

Well-being calculation (cont) 22

Well-being calculation - Prediction Values of rph depend on the strengths of the current predictions and vary between -1 (for predictions of no desirable change) and 1. n If there is no prediction rph = 0. n 23

Well-being calculation - Prediction 24

Well-being calculation - Prediction n Activity prediction provides a no-progress indicator given at regular time intervals when the activity of the robot is low for long periods of time. ¨ rp(activity) n = -1 There is no prediction for welfare ¨ rp(welfare) =0 25

Adaptive System n n Uses Q-learning State information is fed to NN comprising of homeostatic variable values and other perceptual values gathered from sensors. 26

Adaptive System (cont) n Developed controller tries to maximize the reinforcement received by selecting between one of the available handdesigned behaviors. 27

Adaptive System (cont) n n Agent may select between performing the behavior proven better in past or an arbitrary one. Selection function is based on Boltzmann-Gibbs’ distribution. (pg 30 in class textbook). 28

EB II Architecture 29

ALEC Architecture 30

ALEC I Inspired by the CLARION model n Each individual rule consists of a condition for activation and a behavior suggestion. n Activation condition is dictated by a set of intervals, one for each dimension of input space. n 6 input dimensions varying between 0 and 1 with intervals of 0. 2 n 31

ALEC I (cont) A condition interval may only start or end at pre-defined points of the input space. n Since this may lead to a large number of possible states, rule learning is limited to those few cases with successful behavior selection. n Other cases are left to Emotion System which uses its generalization abilities to cover the state space. n 32

ALEC I (cont) Successful behaviors for certain states used to extract a rule corresponding to the decision made and are added to the agent’s rule set. n If same decision is made, the agent updates the Success Rate (SR) for that rule. n 33

ALEC I – Success n n n r = immediate reinforcement Difference of Q-value between state x where decision a was made and the resulting state y. Tsuccess = 0. 2 ¨ Constant threshold 34

ALEC I (cont) – Rule expansion, shrinkage n n If a rule is often successful, the agent tries to generalize it to cover nearby environmental states. If a rule is very poor, the agent makes it more specific. If it still does not improve, the rule is deleted. Maximum of 100 rules. 35

ALEC I (cont) – Rule expansion, shrinkage Statistics are kept for the success rate of every possible one-state expansion or shrinkage of the rule, to select best option. n Rule is compared to a “match all” rule (rule_all) with the same behavior suggestion and against itself after the best expansion or shrinkage (rule_exp, rule_shrink). n 36

ALEC I (cont) – Rule expansion, shrinkage A rule is expanded if it is significantly better than the match-all rule and the expanded rule is better or equal to the original rule. n A rule that is insufficiently better than the match-all rule is shrunk if this results in an improvement or otherwise is deleted. n 37

ALEC I (cont) – Rule expansion, shrinkage 38

Rule expansion, shrinkage (cont) n Constant thresholds: ¨ Tsuccess = 0. 2 ¨ Thresholds Texpand = 2. 0 ¨ Tshrunk = 1. 0 39

ALEC I (cont) – Rule expansion, shrinkage n n A rule that performs badly is deleted. A rule is also deleted if its condition has not been met for a while. When two rules propose the same behavior selection and their conditions are sufficiently similar, they are merged into a single rule. Success rate is reset whenever a rule is modified by merging, expansion or shrinkage. 40

Cognitive System (cont) n If the cognitive system has a rule that applies to the current environmental state, then the cognitive system influences the behavior decision. ¨ Adds an arbitrary constant of 1. 0 to the respective Q-value before the stochastic behavior selection is made. 41

ALEC Architecture 42

43

Example of a Rule – execute avoid obstacles. n n n n Six input dimensions segmented with 0. 2 granularity – 0, 0. 2, 0. 4, 0. 6, 0. 8, 1 energy = [0. 6, 1] activity = [0, 1] welfare = [0, 0. 6] light intensity = [0, 1] obstacle density = [0. 8, 1] energy availability = [0, 1] 44

ALEC II n Instead of the above function, the agent considers that a behavior is successful if there is a positive homeostatic variable transition. ¨ If a variable state changes to the target state from the danger state. 45

ALEC III n Same as ALEC II except that the wellbeing does not depend on state transitions nor predictions ¨ ct(sh) ¨ cp =0 =0 46

Experiments n n n Goal of ALEC is to allow an agent faced with realistic world conditions to adapt on-line and autonomously to its environment. Cope with continuous time and space Limited memory Time constraints Noisy sensors Unreliable actuators 47

Khepera Robot Left and Right wheel motors n 8 infrared sensors that allow it to detect object proximity and ambient light n ¨ 6 in the front ¨ 2 in the rear 48

Experiment (cont) 49

Goals Maintain Energy n Avoid Obstacles n Move around in environment n ¨ Not as important as the first two. 50

Energy Acquisition n Must overlook goal of avoiding obstacles ¨ Must n Energy is available for a short period ¨ Must n bump into source look for new sources Energy is received by high values of light in rear sensors 51

Procedure n Each experiment consisted of: ¨ 100 different robot trials of 3 million simulation steps A new fully recharged robot with all state values reset placed at randomly selected starting positions in each trial n For evaluation the trial period was divided into 60 smaller periods of 50, 000 steps. n 52

Procedure (cont) n For each of these periods the following were recorded: ¨ Reinforcement – mean of reinforcement (well-being) value calculated at each step. ¨ Energy – mean level of robot ¨ Distance – mean value of Euclidean distance d taken at 100 -step intervals (approx. # steps to move between corners of environment) ¨ Collisions – percentage of steps involving collisions. 53

Results n Pairs of controllers were compared using a randomized analysis of variance (RANOVA) by Piater (1999) 54

Results (cont) n n Most important contribution to reinforcement is the state value. For the successful accomplishment of the task and goals, all homeostatic variables should be taken into consideration in reinforcement. Agents with no: ¨ Energy dependent reinforcement fail in their main task of maintaining energy levels. ¨ Welfare – increased collisions ¨ Activity – move only as a last resort (avoid collisions) 55

Results (cont) n Predictions of state transitions proved essential for an agent to accomplish its tasks. ¨ Controller with no energy prediction is unable to acquire energy. ¨ Controller with no activity prediction will eventually stop moving. 56

Results – EB, EBII and Random n The first set of graphs is dealing with three different agents: ¨ EB – discussed in earlier paper ¨ EB II ¨ Random – selects randomly amongst the differently available behaviors at regular intervals. 57

Results – EB, EBII and Random 58

59

60

61

62

63

64

65

66

Conclusion Emotion and Cognitive systems can improve learning, but are unable to store and consult all the single events the agent experiences. n The Emotion system gives a “sense” of what is right, while the Cognitive system has constructs a model of reality and corrects the emotion system when it reaches incorrect conclusions. n 67

Future work n Adding more specific knowledge in the cognitive system which then may be used for planning of more complex tasks. 68