Analyzing iterated learning Tom Griffiths Mike Kalish Brown

Analyzing iterated learning Tom Griffiths Mike Kalish Brown University of Louisiana

Cultural transmission • Most knowledge is based on secondhand data • Some things can only be learned from others – cultural objects transmitted across generations • Studying the cognitive aspects of cultural transmission provides unique insights…

Iterated learning (Kirby, 2001) • Each learner sees data, forms a hypothesis, produces the data given to the next learner • c. f. the playground game “telephone”

Objects of iterated learning • It’s not just about languages… • In the wild: – religious concepts – social norms – myths and legends – causal theories • In the lab: – functions and categories

Outline 1. Analyzing iterated learning 2. Iterated Bayesian learning 3. Examples 4. Iterated learning with humans 5. Conclusions and open questions

Discrete generations of single learners PL(h|d) PP(d|h) PL(h|d): probability of inferring hypothesis h from data d PP(d|h): probability of generating data d from hypothesis h

Markov chains x x x x Transition matrix T = P(x(t+1)|x(t)) • Variables x(t+1) independent of history given x(t) • Converges to a stationary distribution under easily checked conditions for ergodicity

Stationary distributions • Stationary distribution: • In matrix form • is the first eigenvector of the matrix T • Second eigenvalue sets rate of convergence

A Markov chain on hypotheses • Transition probabilities sum out data • Stationary distribution and convergence rate from eigenvectors and eigenvalues of Q – can be computed numerically for matrices of reasonable size, and analytically in some cases

Infinite populations in continuous time • “Language dynamical equation” (Nowak, Komarova, & Niyogi, 2001) • “Neutral model” (fj(x) constant) (Komarova & Nowak, 2003) • Stable equilibrium at first eigenvector of Q

Outline 1. Analyzing iterated learning 2. Iterated Bayesian learning 3. Examples 4. Iterated learning with humans 5. Conclusions and open questions

Bayesian inference • Rational procedure for updating beliefs • Foundation of many learning algorithms (e. g. , Mackay, 2003) • Widely used for language learning (e. g. , Charniak, 1993) Reverend Thomas Bayes

Bayes’ theorem Posterior probability h: hypothesis d: data Likelihood Prior probability Sum over space of hypotheses

Iterated Bayesian learning Learners are Bayesian agents

Markov chains on h and d • Markov chain on h has stationary distribution the prior • Markov chain on d has stationary distribution the prior predictive distribution

Markov chain Monte Carlo • A strategy for sampling from complex probability distributions • Key idea: construct a Markov chain which converges to a particular distribution – e. g. Metropolis algorithm – e. g. Gibbs sampling

Gibbs sampling For variables x = x 1, x 2, …, xn Draw xi(t+1) from P(xi|x-i) x-i = x 1(t+1), x 2(t+1), …, xi-1(t+1), xi+1(t), …, xn(t) Converges to P(x 1, x 2, …, xn) (Geman & Geman, 1984) (a. k. a. the heat bath algorithm in statistical physics)

Gibbs sampling (Mac. Kay, 2003)

Iterated learning is a Gibbs sampler • Iterated Bayesian learning is a sampler for • Implies: – (h, d) converges to this distribution – converence rates are known (Liu, Wong, & Kong, 1995)

Outline 1. Analyzing iterated learning 2. Iterated Bayesian learning 3. Examples 4. Iterated learning with humans 5. Conclusions and open questions

An example: Gaussians • If we assume… – data, d, is a single real number, x – hypotheses, h, are means of a Gaussian, – prior, p( ), is Gaussian( 0, 02) • …then p(xn+1|xn) is Gaussian( n, x 2 + n 2)

An example: Gaussians • If we assume… – data, d, is a single real number, x – hypotheses, h, are means of a Gaussian, – prior, p( ), is Gaussian( 0, 02) • …then p(xn+1|xn) is Gaussian( n, x 2 + n 2) • p(xn|x 0) is Gaussian( 0+cnx 0, ( x 2 + 02)(1 - c 2 n)) i. e. geometric convergence to prior

An example: Gaussians • p(xn+1|x 0) is Gaussian( 0+cnx 0, ( x 2 + 02)(1 -c 2 n))

0 = 0, 02 = 1, x 0 = 20 Iterated learning results in rapid convergence to prior

An example: Linear regression • Assume – data, d, are pairs of real numbers (x, y) – hypotheses, h, are functions • An example: linear regression – hypotheses have slope and pass through origin – p( ) is Gaussian( 0, 02) y } x=1

y 0 = 1, 02 = 0. 1, y 0 = -1 } x=1

An example: compositionality “agents” 0 “actions” 1 compositional “nouns” 0 1 0 0 1 1 events x language function utterances y “verbs”

An example: compositionality 0 1 compositional 0 P(h) 1 0 0 1 1 0 1 holistic 0 1 0 0 1 1 • Data: m event-utterance pairs • Hypotheses: languages, with error

Analysis technique 1. Compute transition matrix on languages 2. Sample Markov chains 3. Compare language frequencies with prior (can also compute eigenvalues etc. )

Convergence to priors Effect of Prior = 0. 50, = 0. 05, m = 3 = 0. 01, = 0. 05, m = 3 Iteration Chain Prior

The information bottleneck No effect of bottleneck = 0. 50, = 0. 05, m = 1 = 0. 01, = 0. 05, m = 3 = 0. 50, = 0. 05, m = 10 Iteration Chain Prior

The information bottleneck Bottleneck affects relative stability of languages favored by prior

Outline 1. Analyzing iterated learning 2. Iterated Bayesian learning 3. Examples 4. Iterated learning with humans 5. Conclusions and open questions

A method for discovering priors Iterated learning converges to the prior… …evaluate prior by producing iterated learning

Iterated function learning data hypotheses • Each learner sees a set of (x, y) pairs • Makes predictions of y for new x values • Predictions are data for the next learner

Function learning in the lab Stimulus Feedback Response Slider Examine iterated learning with different initial data

Initial data Iteration 1 2 3 4 5 6 7 8 9 (Kalish, 2004)

Outline 1. Analyzing iterated learning 2. Iterated Bayesian learning 3. Examples 4. Iterated learning with humans 5. Conclusions and open questions

Conclusions and open questions • Iterated Bayesian learning converges to prior – properties of languages are properties of learners – information bottleneck doesn’t affect equilibrium • What about other learning algorithms? • What determines rates of convergence? – amount and structure of input data • What happens with people?