Bayesian Models of Human Learning and Inference Josh

Bayesian Models of Human Learning and Inference Josh Tenenbaum MIT Department of Brain and Cognitive Sciences

Shiffrin Says “Progress in science is driven by new tools, not great insights. ”

Outline • Part I. Brief survey of Bayesian modeling in cognitive science. • Part II. Bayesian models of everyday inductive leaps.

Collaborators Tom Griffiths Charles Kemp Tevye Krynski Sourabh Niyogi Neville Sanjana Mark Steyvers Sean Stromsten Fei Xu Wheeler Ruml Dave Sobel Alison Gopnik

Outline • Part I. Brief survey of Bayesian modeling in cognitive science. – Rational benchmark for descriptive models of probability judgment. – Rational analysis of cognition – Rational tools for fitting cognitive models

Normative benchmark for descriptive models • How does human probability judgment compare to the Bayesian ideal? – Peterson & Beach, Edwards, Tversky & Kahneman, . . • Explicit probability judgment tasks – Drawing balls from an urn, rolling dice, medical diagnosis, . . • Alternative descriptive models – Heuristics and Biases, Support Theory, . .

Rational analysis of cognition • Develop Bayesian models for core aspects of cognition not traditionally thought of in terms of statistical inference. • Examples: – Memory retrieval: Anderson; Shiffrin et al, . . – Reasoning with rules: Oaksford & Chater, . .

Rational analysis of cognition • Often can explain a wider range of phenomena than previous models, with fewer free parameters. Power laws of practice and retention Spacing effects on retention

Rational analysis of cognition • Often can explain a wider range of phenomena than previous models, with fewer free parameters. • Anderson’s rational analysis of memory: – For each item in memory, estimate the probability that it will be useful in the present context. – Model of need probability inspired by library book access. Corresponds to statistics of natural information sources:

Rational analysis of cognition – For each item in memory, estimate the probability that it will be useful in the present context. – Model of need probability inspired by library book access. Corresponds to statistics of natural information sources: Log need odds • Often can explain a wider range of phenomena than previous models, with fewer free parameters. • Anderson’s rational analysis of memory: Short lag Long lag Log days since last occurrence

Rational analysis of cognition • Often can show that apparently irrational behavior is actually rational. Which cards do you have to turn over to test this rule? “If there is an A on one side, then there is a 2 on the other side”

Rational analysis of cognition • Often can show that apparently irrational behavior is actually rational. • Oaksford & Chater’s rational analysis: – Optimal data selection based on maximizing expected information gain. – Test the rule “If p, then q” against the null hypothesis that p and q are independent. – Assuming p and q are rare predicts people’s choices:

Rational tools for fitting cognitive models • Use Bayesian Occam’s Razor to solve the problem of model selection: trade off fit to the data with model complexity. • Examples: – Comparing alternative cognitive models: Myung, Pitt, . . – Fitting nested families of models of mental representation: Lee, Navarro, . .

Rational tools for fitting cognitive models • Comparing alternative cognitive models via an MDL approximation to the Bayesian Occam’s Razor takes into account the functional form of a model as well as the number of free parameters.

Rational tools for fitting cognitive models • Fit models of mental representation to similarity data, e. g. additive clustering, additive trees, common and distinctive feature models. • Want to choose the complexity of the model (number of features, depth of tree) in a principled way, and search efficiently through the space of nested models. Using Bayesian Occam’s Razor:

Outline • Part I. Brief survey of Bayesian modeling in cognitive science. • Part II. Bayesian models of everyday inductive leaps. Rational models of cognition where Bayesian model selection, Bayesian Occam’s Razor play central explanatory role.

Everyday inductive leaps How can we learn so much about. . . – Properties of natural kinds – Meanings of words – Future outcomes of a dynamic process – Hidden causal properties of an object – Causes of a person’s action (beliefs, goals) – Causal laws governing a domain . . . from such limited data?

Learning concepts and words

Learning concepts and words “tufa” Can you pick out the tufas?

Inductive reasoning Input: Cows can get Hick’s disease. Gorillas can get Hick’s disease. (premises) All mammals can get Hick’s disease. (conclusion) Task: Judge how likely conclusion is to be true, given that premises are true.

Inferring causal relations Input: Day 1 Day 2 Day 3 Day 4. . . Took vitamin B 23 Headache yes no yes. . . no yes no. . . Does vitamin B 23 cause headaches? Task: Judge probability of a causal link given several joint observations.

The Challenge • How do we generalize successfully from very limited data? – Just one or a few examples – Often only positive examples • Philosophy: – Induction is a “problem”, a “riddle”, a “paradox”, a “scandal”, or a “myth”. • Machine learning and statistics: – Focus on generalization from many examples, both positive and negative.

Rational statistical inference (Bayes, Laplace) Posterior probability Likelihood Prior probability

History of Bayesian Approaches to Human Inductive Learning • Hunt

History of Bayesian Approaches to Human Inductive Learning • Hunt • Suppes – “Observable changes of hypotheses under positive reinforcement”, Science (1965), w/ M. Schlag-Rey. “A tentative interpretation is that, when the set of hypotheses is large, the subject ‘samples’ or attends to several hypotheses simultaneously. . It is also conceivable that a subject might sample spontaneously, at any time, or under stimulations other than those planned by the experimenter. A more detailed exploration of these ideas, including a test of Bayesian approaches to information processing, is now being made. ”

History of Bayesian Approaches to Human Inductive Learning • Hunt • Suppes • Shepard – Analysis of one-shot stimulus generalization, to explain the universal exponential law. • Anderson – Rational analysis of categorization.

Theory-Based Bayesian Models • Explain the success of everyday inductive leaps based on rational statistical inference mechanisms constrained by domain theories well-matched to the structure of the world. • Rational statistical inference (Bayes): • Domain theories generate the necessary ingredients: hypothesis space H, priors p(h).

Questions about theories • What is a theory? – Working definition: an ontology and a system of abstract (causal) principles that generates a hypothesis space of candidate world structures (e. g. , Newton’s laws). • How is a theory used to learn about the structure of the world? • How is a theory acquired? – Probabilistic generative model learning. statistical

Alternative approaches to inductive generalization • • • Associative learning Connectionist networks Similarity to examples Toolkit of simple heuristics Constraint satisfaction

Marr’s Three Levels of Analysis • Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? ” • Representation and algorithm: Cognitive psychology • Implementation: Neurobiology

Descriptive Goals • Principled mathematical models, with a minimum of arbitrary assumptions. • Close quantitative fits to behavioral data. • Unified models of cognition across domains.

Explanatory Goals • How do we reliably acquire knowledge about the structure of the world, from such limited experience? • Which processing models work, and why? • New views on classic questions in cognitive science: – Symbols (rules, logic, hierarchies, relations) versus Statistics. – Theory-based inference versus Similarity-based inference. – Domain-specific knowledge versus Domain-general mechanisms. • Provides a route to studying people’s hidden (implicit or unconscious) knowledge about the world.

The plan • • Basic causal learning Inferring number concepts Reasoning with biological properties Acquisition of domain theories – Intuitive biology: Taxonomic structure – Intuitive physics: Causal law

Learning a single causal relation Given a random sample of mice: Injected with X Not injected with X Expressed Y 45 30 Did not express Y 15 30 • “To what extent does chemical X cause gene Y to be expressed? ” • Or, “What is the probability that X causes Y? ”

Associative models of causal strength judgment c+ c- (injected (not injected with X) e+ (expressed Y) e- (did not express Y) a c b d • Delta-P (or Asymptotic Rescorla-Wagner): • Power PC (Cheng, 1997):

Some behavioral data Buehner & Cheng, 1997 DP =0 DP . 2 0 = 5 DP . 5 0 = DP . 75 0 = D 1 = P People DP Power PC • Independent effects of both causal power and DP. • Neither theory explains the trend for DP=0.

Bayesian causal inference • Hypotheses: h 1 = B C w 0 w 1 E h 0 = C B w 0 E w 0, w 1: strength parameters for B, C

Bayesian causal inference • Hypotheses: h 1 = B C w 0 w 1 h 0 = C B w 0 E E w 0, w 1: strength parameters for B, C • Probabilistic model: “noisy-OR” C B 0 1 0 0 1 1 h 1: h 0: 0 w 1 w 0 w 1+ w 0 – w 1 w 0 0 0 w 0

Bayesian causal inference • Hypotheses: h 1 = Background cause B unobserved, always present (B=1) B C w 0 w 1 h 0 = w 0 E E w 0, w 1: strength parameters for B, C • Probabilistic model: “noisy-OR” C B 0 1 0 0 1 1 C B h 1: h 0: 0 w 1 w 0 w 1+ w 0 – w 1 w 0 0 0 w 0

Inferring structure versus estimating strength • Hypotheses: h 1 = B C w 0 w 1 h 0 = C B w 0 E E • Both causal power and DP correspond to maximum likelihood estimates of the strength parameter w 1, under different parameterizations for p(E|B, C): – linear DP, Noisy-OR causal power • Causal support model: people are judging the probability that a causal link exists, rather than assuming it exists and estimating its strength.

Role of domain theory (c. f. PRMs, ILP, Knowledge-based model construction) Generates hypothesis space of causal graphical models: • Causally relevant attributes of objects: – Constrains random variables (nodes). • Causally relevant relations between attributes: – Constrains dependence structure of variables (arcs). • Causal mechanisms – how effects depend functionally on their causes: – Constrains local probability distribution for each variable conditioned on its direct causes (parents).

Role of domain theory • Injections may or may not cause gene expression, but gene expression does not cause injections. – No hypotheses with E C • Other naturally occurring processes may also cause gene expression. – All hypotheses include an always-present background cause B C • Causes are probabilistically sufficient and independent (Cheng): Each cause independently produces the effect in some proportion of cases. – “Noisy-OR” causal mechanism

• Hypotheses: h 1 = B C w 0 w 1 E h 0 = C B w 0 E • Bayesian causal inference: noisy-OR Assume all priors uniform. .

Bayesian Occam’s Razor C B P( data | model ) w 0 E low w 1 B C w 0 w 1 E All possible data sets increasing DP high w 1

Bayesian Occam’s Razor C B P( data | model ) w 0 E low w 1 B C w 0 w 1 E high w 1

Buehner & Cheng, 1997 D 0 = P People DP Power PC Bayes D P= 5 0. 2 D P= 0. 5 DP 5 7. =0 D 1 = P

Sensitivity analysis • How much work does domain theory do? – Alternative model: Bayes with arbitrary P(E|B, C) Bayes without noisy-OR theory • How much work does Bayes do? – Alternative model: c 2 measure of independence. c 2

People DP Power PC (MLE w/ noisy-OR) Bayes w/ noisy-OR theory Bayes without noisy-OR theory c 2

Varying number of observations D 0 = P People (n=8) Bayes (n=8) People (n=60) Bayes (n=60) D P= 5 0. 2 D P= 0. 5 DP 5 7. =0 D 1 = P

Data for inhibitory causes D 0 = P People DP Power PC (MLE w/ noisy-AND-NOT) Bayes w/ noisy-AND-NOT 5 DP = 2 -0. DP = 5 -0. 75. 1 0 = = DP DP

Causal inference with rates People DR Power PC (N=150) Bayes w/ Poisson parameterization

Causal induction: summary • People’s judgments closely reflect optimal Bayesian model selection, constrained by a minimal domain theory. • Beyond elemental causal induction: – More complex inferences, with causal networks, hidden variables, active learning. – Stronger inferences, with richer prior knowledge. – Discovery of causal domain theories.

The plan • • Basic causal learning Inferring number concepts Reasoning with biological properties Acquisition of domain theories – Intuitive biology: Taxonomic structure – Intuitive physics: Causal law

The number game • Program input: number between 1 and 100 • Program output: “yes” or “no”

The number game • Learning task: – Observe one or more positive (“yes”) examples. – Judge whether other numbers are “yes” or “no”.

The number game Examples of “yes” numbers 60 Generalization judgments (N = 20) Diffuse similarity

The number game Examples of “yes” numbers 60 60 80 10 30 Generalization judgments (n = 20) Diffuse similarity Rule: “multiples of 10”

The number game Examples of “yes” numbers 60 Generalization judgments (N = 20) Diffuse similarity 60 80 10 30 Rule: “multiples of 10” 60 52 57 55 Focused similarity: numbers near 50 -60

The number game Examples of “yes” numbers 16 Generalization judgments (N = 20) Diffuse similarity 16 8 2 64 Rule: “powers of 2” 16 23 19 20 Focused similarity: numbers near 20

The number game 60 Diffuse similarity 60 80 10 30 Rule: “multiples of 10” 60 52 57 55 Focused similarity: numbers near 50 -60 Main phenomena to explain: – Generalization can appear either similaritybased (graded) or rule-based (all-or-none). – Learning from just a few positive examples.

Rule/similarity hybrid models • Category learning – Nosofsky, Palmeri et al. : RULEX – Erickson & Kruschke: ATRIUM

Divisions into “rule” and “similarity” subsystems • Category learning – Nosofsky, Palmeri et al. : RULEX – Erickson & Kruschke: ATRIUM • Language processing – Pinker, Marcus et al. : Past tense morphology • Reasoning – Sloman – Rips – Nisbett, Smith et al.

Rule/similarity hybrid models • Why two modules? • Why do these modules work the way that they do, and interact as they do? • How do people infer a rule or similarity metric from just a few positive examples?

Bayesian model • H: Hypothesis space of possible concepts. – – – h 1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”) h 2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”) h 3 = {2, 4, 8, 16, 32, 64} (“powers of 2”) h 4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”). . . Representational interpretations for H: – Candidate rules – Features for similarity – “Consequential subsets” (Shepard, 1987)

Three hypothesis subspaces for number concepts • Mathematical properties (24 hypotheses): – Odd, even, square, cube, prime numbers – Multiples of small integers – Powers of small integers • Raw magnitude (5050 hypotheses): – All intervals of integers with endpoints between 1 and 100. • Approximate magnitude (10 hypotheses): – Decades (1 -10, 10 -20, 20 -30, …)

Hypothesis spaces and theories • Why a hypothesis space is like a domain theory: – Represents one particular way of classifying entities in a domain. – Not just an arbitrary collection of hypotheses, but a principled system. • What’s missing? – Explicit representation of the principles. – [Causality. ] • Hypothesis space is generated by theory.

Bayesian model • H: Hypothesis space of possible concepts. – Mathematical properties: even, odd, square, prime, . . – Approximate magnitude: {1 -10}, {10 -20}, {20 -30}, . . – Raw magnitude: all intervals between 1 and 100. • X = {x 1, . . . , xn}: n examples of a concept C. • Evaluate hypotheses given data: – p(h) [“prior”]: domain knowledge, pre-existing biases – p(X|h) [“likelihood”]: statistical information in examples. – p(h|X) [“posterior”]: degree of belief that h is the true extension of C.

Likelihood: p(X|h) • Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases. • Follows from assumption of randomly sampled examples. • Captures the intuition of a representative sample.

Illustrating the size principle h 1 2 12 22 32 42 52 62 72 82 92 4 14 24 34 44 54 64 74 84 94 6 16 26 36 46 56 66 76 86 96 8 10 18 20 28 30 38 40 48 50 58 60 68 70 78 80 88 90 98 100 h 2

Illustrating the size principle h 1 2 12 22 32 42 52 62 72 82 92 4 14 24 34 44 54 64 74 84 94 6 16 26 36 46 56 66 76 86 96 8 10 18 20 28 30 38 40 48 50 58 60 68 70 78 80 88 90 98 100 h 2 Data slightly more of a coincidence under h 1

Illustrating the size principle h 1 2 12 22 32 42 52 62 72 82 92 4 14 24 34 44 54 64 74 84 94 6 16 26 36 46 56 66 76 86 96 8 10 18 20 28 30 38 40 48 50 58 60 68 70 78 80 88 90 98 100 h 2 Data much more of a coincidence under h 1

Bayesian Occam’s Razor p(D = d | M ) M 1 M 2 All possible data sets d For any model M,

Prior: p(h) • Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. • Prevents overfitting by highly specific but unnatural hypotheses, e. g. “multiples of 10 except 50 and 70”.

A domain-general approach to priors? • Start with a base set of regularities R and combination operators C. • Hypothesis space = closure of R under C. – C = {and, or}: H = unions and intersections of regularities in R (e. g. , “multiples of 10 between 30 and 70”). – C = {and-not}: H = regularities in R with exceptions (e. g. , “multiples of 10 except 50 and 70”). • Two qualitatively similar priors: – Description length: number of combinations in C needed to generate hypothesis from R. – Bayesian Occam’s Razor, with model classes defined by number of combinations: more combinations more hypotheses lower prior

Prior: p(h) • Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. • Prevents overfitting by highly specific but unnatural hypotheses, e. g. “multiples of 10 except 50 and 70”. • p(h) encodes relative plausibility of alternative theories: – Mathematical properties: p(h) ~ 1 – Approximate magnitude: p(h) ~ 1/10 – Raw magnitude: p(h) ~ 1/50 (on average) p(s) • Also degrees of plausibility within a theory, e. g. , for magnitude intervals of size s: s

Posterior: • X = {60, 80, 10, 30} • Why prefer “multiples of 10” over “even numbers”? p(X|h). • Why prefer “multiples of 10” over “multiples of 10 except 50 and 20”? p(h). • Why does a good generalization need both high prior and high likelihood? p(h|X) ~ p(X|h) p(h)

Bayesian Occam’s Razor Probabilities provide a common currency for balancing model complexity with fit to the data.

Generalizing to new objects Given p(h|X), how do we compute the probability that C applies to some new stimulus y? ,

Generalizing to new objects Hypothesis averaging: Compute the probability that C applies to some new object y by averaging the predictions of all hypotheses h, weighted by p(h|X):

Examples: 16

Examples: 16 8 2 64

Examples: 16 23 19 20

+ Examples 60 60 80 10 30 60 52 57 55 16 16 8 2 64 16 23 19 20 Human generalization Bayesian Model

Summary of the Bayesian model • How do the statistics of the examples interact with prior knowledge to guide generalization? • Why does generalization appear rule-based or similarity-based? broad p(h|X): similarity gradient narrow p(h|X): all-or-none rule

Summary of the Bayesian model • How do the statistics of the examples interact with prior knowledge to guide generalization? • Why does generalization appear rule-based or similarity-based? Many h of similar size: broad p(h|X) One h much smaller: narrow p(h|X)

Alternative models • Neural networks – Supervised learning inapplicable. – Simple unsupervised learning not sufficient: even 60 80 10 30 multiple of 10 multiple of 3 power of 2

Alternative models • Neural networks • Similarity to exemplars – Average similarity: 60 60 80 10 30 60 52 57 55 Data Model (r = 0. 80)

Alternative models • Neural networks • Similarity to exemplars – Average similarity – Max similarity: 60 60 80 10 30 60 52 57 55 Data Model (r = 0. 64)

Alternative models • Neural networks • Similarity to exemplars – Average similarity – Max similarity – Flexible similarity? Bayes.

Explaining similarity • Hypothesis: A principal function of similarity is generalization. • A theory of generalization can thus explain (some aspects of) similarity: – The similarity of X to Y is to a significant degree determined by the probability of generalizing from X to Y, or from Y to X, or both. • Opposite of traditional approach: similarity explains generalization.

Explaining similarity • Spatial models – Why exponential decay with distance? • Common feature models – Why additive measure? – What determines feature weights, and why? • • Specificity Relational preference Diagnosticity Context-sensitivity • Contrast model – Why (and when) are both common & distinctive features relevant? – When is similarity asymmetric?

Alternative models • Neural networks • Similarity to exemplars – Average similarity – Max similarity – Flexible similarity? Bayes. • Toolbox of simple heuristics – 60: “general” similarity – 60 80 10 30: most specific rule (“subset principle”). – 60 52 57 55: similarity in magnitude Why these heuristics? When to use which heuristic? Bayes.

Numbers: Summary • Theory-based statistical inference explains inductive generalization from one or a few examples. • Explains the dynamics of both rule-like and similarity-like generalization through the interaction of: – Structure of domain-specific knowledge. – Domain-general principles of rational inference.

Limitations of the number game • No sense in which theory is the “right” or “wrong” description of world structure. – Number game is conventional, not natural. • Purely logical structure of theory does much of the work, with statistics just selecting among hypotheses. – Theory itself is not probabilistic. • Theory just amounts to a systematization for a set of hypotheses. – No causal mechanisms.

The plan • • Basic causal learning Inferring number concepts Reasoning with biological properties Acquisition of domain theories – Intuitive biology: Taxonomic structure – Intuitive physics: Causal law

Which argument is stronger? Horses have biotinic acid in their blood Cows have biotinic acid in their blood Rhinos have biotinic acid in their blood All mammals have biotinic acid in their blood Squirrels have biotinic acid in their blood Dolphins have biotinic acid in their blood Rhinos have biotinic acid in their blood All mammals have biotinic acid in their blood

Osherson, Smith, Wilkie, Lopez, Shafir (1990): • 20 subjects rated the strength of 45 arguments: X 1 have property P. X 2 have property P. X 3 have property P. All mammals have property P. • 40 different subjects rated the similarity of all pairs of 10 mammals.

Similarity-based models (Osherson et al. ) strength(“all mammals” | X ) x x x Mammals: Examples: x

Similarity-based models (Osherson et al. ) strength(“all mammals” | X ) • Sum-Similarity: x x S x Mammals: Examples: x

Similarity-based models (Osherson et al. ) strength(“all mammals” | X ) • Max-Similarity: x x ma x x Mammals: Examples: x

Similarity-based models (Osherson et al. ) strength(“all mammals” | X ) • Max-Similarity: x x x Mammals: Examples: x

Sum-Sim versus Max-Sim • Two models appear functionally similar: – Both increase monotonically as new examples are observed. • Reasons to prefer sum-sim: – Standard form of exemplar models of categorization, memory, and object recognition. – Analogous to kernel density estimation techniques in statistical pattern recognition. • Reasons to prefer max-sim: – Fit to generalization judgments. .

Data vs. models Model . Each “ ” represents one argument: X 1 have property P. X 2 have property P. X 3 have property P. All mammals have property P.

Three data sets Max-sim Sum-sim Conclusion kind: “all mammals” “horses” Number of examples: 3 2 1, 2, or 3

Explaining similarity • Why does max-sim fit so well? • Why does sum-sim fit so poorly? • Are there cases where max-sim will fail?

Marr’s Three Levels of Analysis • Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? ” • Representation and algorithm: Max-Sim, Sum-Sim • Implementation: Neurobiology

Scientific theory of biology • Species generated by an evolutionary branching process. – A tree-structured taxonomy of species.

Scientific theory of biology • Species generated by an evolutionary branching process. – A tree-structured taxonomy of species. • Features generated by stochastic mutation process and passed on to descendants. – Similarity a function of distance in tree.

An intuitive theory of biology • Species generated by an evolutionary branching process. – A tree-structured taxonomy of species. • Features generated by stochastic mutation process and passed on to descendants. – Similarity a function of distance in tree. Sources: Cognitive anthropology: Atran, Medin Cognitive development: Keil, Carey

A model of theory-based induction rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al a ho rill go ch im p 1. Reconstruct intuitive taxonomy from similarity judgments:

A model of theory-based induction 2. Hypothesis space H: each taxonomic cluster is a possible hypothesis for the extension of a novel feature. h 1 h 6 rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al ho rill a h 17 go im p h 3 ch h 0: “all mammals” . . .

p(h): uniform rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al ho p a rill go im ch h 0: “all mammals”

Bayes (taxonomic) Max-sim Sum-sim Conclusion kind: “all mammals” “horses” Number of examples: 3 2 1, 2, or 3

Cows have property P. Dolphins have property P. Squirrels have property P. Bayes (taxonomic) All mammals have property P. Seals have property P. Dolphins have property P. Squirrels have property P. Max-sim All mammals have property P. Sum-sim Conclusion kind: “all mammals” Number of examples: 3

ho rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al a rill go ch im p h 0: “all mammals” Cows have property P. Dolphins have property P. Squirrels have property P. All mammals have property P.

ho rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al a rill go ch im p h 0: “all mammals” Seals have property P. Dolphins have property P. Squirrels have property P. All mammals have property P.

Scientific theory of biology • Species generated by an evolutionary branching process. – A tree-structured taxonomy of species. • Features generated by stochastic mutation process and passed on to descendants. – Similarity a function of distance in tree.

Scientific theory of biology • Species generated by an evolutionary branching process. – A tree-structured taxonomy of species. • Features generated by stochastic mutation process and passed on to descendants. – Similarity a function of distance in tree. – Novel features can appear anywhere in tree, but some distributions are more likely than others.

A model of theory-based induction 2. Hypothesis space H: each taxonomic cluster is a possible hypothesis for the extension of a novel feature. h 1 h 6 rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al ho rill a h 17 go im p h 3 ch h 0: “all mammals” . . .

A model of theory-based induction co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al ho r se a rill go ch im p 2. Generate hypotheses for novel feature F via (Poisson arrival) mutation process over branches b:

A model of theory-based induction 2. Generate hypotheses for novel feature F via (Poisson arrival) mutation process over branches b: co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al ho r se a rill go ch im p Induced prior p(h): • Every subset of objects is a possible hypothesis • Prior p(h) depends on the number and length of branches needed to span h.

Bayesian Occam’s Razor Probabilities provide a common currency for balancing model complexity with fit to the data.

Induced prior p(h) ho rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al a rill go ch im p • Monophyletic properties more likely than polyphyletic properties: p( {horse, cow, elephant, rhino} ) > p( {chimp, gorilla, elephant, rhino} )

Induced prior p(h) ho rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al a rill go ch im p • Novel properties more likely to occur on long branches than on short branches: p( {dolphin, seal} ) > p( {horse, cow} )

ho rs e co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al a rill go ch im p h 0: “all mammals” p(h): “evolutionary” process (mutation + inheritance)

Bayes (taxonomic) Max-sim Sum-sim Conclusion kind: “all mammals” “horses” Number of examples: 3 2 1, 2, or 3

Bayes (taxonomy+ mutation) Max-sim Sum-sim Conclusion kind: “all mammals” “horses” Number of examples: 3 2 1, 2, or 3

Model variants • Version 1: Simple taxonomic hypothesis space instead of full hypothesis space with prior based on mutation process. • Version 2: Simple taxonomic hypothesis space with Hebbian learning instead of Bayesian inference. • Version 3: Taxonomy based on actual evolutionary tree rather than psychological similarity.

r=0. 51 r=0. 41 r=0. 90 r= -0. 41 r=0. 88 r=0. 45 r=0. 40 r=0. 61 Conclusion kind: “all mammals” “horses” Number of examples: 3 2 1, 2, or 3 Bayes (taxonomic) Hebb (taxonomic) Bayes (actual evolutionary tree)

Mutation principle versus pure Occam’s Razor • Mutation principle provides a version of Occam’s Razor, by favoring hypotheses that span fewer disjoint clusters. • Could we use a more generic Bayesian Occam’s Razor, without the biological motivation of mutation?

A model of theory-based induction 2. Generate hypotheses for novel feature F via (Poisson arrival) mutation process over branches b: co w el ep ha nt rh in m o ou s sq e ui rre do l lp hi n se al ho r se a rill go ch im p Induced prior p(h): • Every subset of objects is a possible hypothesis • Prior p(h) depends on the number and length of branches needed to span h.

Bayes (taxonomy+ Occam) Premise typicality effect (Rips, 1975; Osherson et al. , 1990): Strong: Max-sim Horses have property P. All mammals have property P. Sum-sim Weak: Seals have property P. Conclusion kind: “all mammals” Number of examples: 1 All mammals have property P.

Bayes (taxonomy+ mutation) Premise typicality effect (Rips, 1975; Osherson et al. , 1990): Strong: Max-sim Horses have property P. All mammals have property P. Sum-sim Weak: Seals have property P. Conclusion kind: “all mammals” Number of examples: 1 All mammals have property P.

Intuitive versus scientific theories of biology • Same structure for how species are related. – Tree-structured taxonomy. • Same probabilistic model for traits – Small probability of occurring along any branch at any time, plus inheritance. • Different features – Scientist: genes – People: coarse anatomy and behavior

Bayes (taxonomy+ mutation) Max-sim Sum-sim Conclusion kind: “all mammals” “horses” Number of examples: 3 2 1, 2, or 3

Explaining similarity • Why does max-sim fit so well? • Why does sum-sim fit so poorly? • Are there cases where max-sim will fail?

Explaining similarity • Why does max-sim fit so well? – An efficient and accurate approximation to Bayesian (evolution) model. Correlation with Bayes on three-premise general arguments, over 100 simulated tree structures: Mean r = 0. 94 Correlation (r)

Explaining similarity • Why does max-sim fit so well? – Approximation is domain specific. c. f. , number game: 60 60 80 10 30 60 52 57 55 Data Model (r = 0. 64)

Explaining similarity • Why does sum-sim fit so poorly? – Prefers sets of the most typical examples, which are not representative of category as a whole. Correlation with Bayes on three-premise general arguments, over 100 simulated tree structures: Mean r = – 0. 26 Correlation (r)

Explaining similarity • Are there cases where max-sim will fail? – An example from Medin et al. (in press): Brown bears have property P Polar bears have property P Grizzly bears have property P Horses have property P. Bayesian model makes the correct prediction, due to the size principle (assumption of examples sampled randomly from concept).

A more systematic test of the Size Principle

Biology: Summary • Theory-based statistical inference explains taxonomic inductive reasoning in folk biology. • Reveals essential principles of domain theory. – Category structure: taxonomic tree. – Feature distribution: stochastic mutation process + inheritance. • Clarifies processing-level models. – Why max-sim over sum-sim? – When is max-sim a good heuristic approximation to full Bayesian inference?