Dynamic computational networks John Goldsmith University of Chicago

Work done in collaboration with Gary Larson (Wheaton College) Other work by Bernard Laks

Two models in neurocomputing: 1. In space: lateral inhibition Work done jointly with Gary

The structure of linguistic time n n Language is not merely a sequence of

One view of syllables The Panini-Saussure view: Language is uttered in waves of increasing

Pike-Hockett-Fudge-Selkirk n n The alternative to the wave view of the syllable was proposed

Accent n Metrical theory (Liberman 1975) came in two flavors: – Hierarchical theory (Liberman

Immediate constituents (ICs) n n n So the granddaddy of the constituent theory of

PSGs n n Basic message: The structure in language does not pass from one

a ‘det’ can be followed by an N because there is a rule NP

PSGs are not designed to deal with relationships between adjacent terminal elements. That’s a

PSGs n n are designed to deal with structurally defined positions that can be

PSGs n Not good at dealing with distinct functions assigned to the same distributional

Note what GPSG did: n n Split up PS rules into mother-daughter relations (immediate

What kind of structure do phonological representations need? n n Proposal: They need to

Original motivation for this particular model n n n Dell and Elmedlaoui’s analysis of

n n We take that to be the central operation: search for elements for

Its context? n n Sonority: the inherent sonority of a segment may be influenced

Accent n n The accent on an element is a function of both its

Syllabification and accent are not part of a general, all-purpose phonological computational engine.

Dynamic computational nets ¶ Brief demonstration of the program · Some background on (some

P(i) is the Positional activation assigned to a syllable by virtue of being (first,

Beta = -. 9: rightward spread of activation

Alpha = -. 9; leftward spread of activation

Examples (Hayes) Pintupi (Hansen and Hansen 1969, 1978; Australia): “syllable trochees”: odd-numbered syllables (rightward);

Weri (Boxwell and Boxwell 1966, Hayes 1980, HV 1987) n n Stress the ultima,

Warao (Osborn 1966, HV 1987) Stress penult syllable; plus all even-numbered syllables, counting from

Maranungku (Tryon 1970) Stress first syllable, and All odd-numbered syllables from the beginning of

Garawa (Furby 1974) (or Indonesian, …) n n Stress on Initial syllable; Stress on

Two other potential parameters to explore: n n Penult activation Bias = uniform activation

Why penult? Why not: in most cases, n negative Penult activation = positive Final

Two reasons to consider Penult. . . (in addition to the fact that it’s

Two kinds of cyclic assignment Indonesian type: Stress the penult: …ss. Ss add a

[ S s S ] òtogògráfi Versus [ [ S s s s S

Greek n n n a. [ Inherent Derived s 1 s 2 0 0

Other type (Greek, …). . s s S s add a suffix: …s s

Q-sensitive: Latin stress rule Stress penult if it is heavy; otherwise, stress the antepenult.

Q-sensitive systems: an example: ultima or penult

An analysis a=-0. 2, b = 0. 8 Heavy syllables get D = 2.

= Network M Input (underlying representation) is a vector Dynamics: Quite a surprise! (1)

Please note. . . n This is not a system where you input a

Inversion, again -- note the near eigenvector property Dynamics: Output is S*: equilibrium state

Fast recoverability of underlying form This means that if you take the output S*

Learnability Work done with Gary Larson, reported in his 1993 dissertation (University of Chicago)

a small change in parameters leads to a small change in predictions… in a

We used a variant of simulated annealing. Simulated annealing is usually used to find

We establish an initial “temperature” of 100 degrees. Think of “temperature” as a measure

Training: present forms with correct stress patterns. If the stress patterns are what the

We seek regions, not settings A setting P of parameters is a point, of

The accessability of a metrical system is the measure of the region it maps

The challenge of language: n n For the hearer: he must perceive the (intended)

Never was there a better use of the phrase, “I have a story to

Let’s interrogate the visual system to see if any of its basic components offer

Visual context: edge detection Mach bands

Edge detection through lateral inhibition In a 1 - or 2 -dimensional array of

DOGs Center-surround structures are often modeled as “difference of gaussians”: take 2 gaussian distributions

Difference of gaussians = sombrero? See white board! The Web failed me.

Old news in phonology? Stress on initial syllable or penult has a demarcative function

A brief run-through on lateral inhibition. . . n n Hartline and Ratliff 1957

Recurrent lateral inhibition Recurrent models include loops of activation which retain traces of the

Recurrent lateral inhibition n n …also leads to winner-take-all computations, when the weight of

n Evolution of thinking about visual cell’s receptive field from simple characteristic field (Hubel

Initially lateral inhibition gives rise to edge detection, and classic Mach band phenomena. Observe

DOGs suggest. . . Each syllable unit (in this model) has a receptive field.

Overneighbor term If DOG < 0 then we have sombrero pattern

DOG pattern Gives rise to interesting new patterns: for example, over a wide range

To wrap up: things not spoken of 1. This has been a theory of

Addendum: May 19, 1999 n n Let’s bring time and dynamical systems into the

Quantity-sensitive L->R alternation As in Yup’ik: In a sequence of light (CV or CVC)

da dá dáa da dá 0 1 1 0 1 Note that you reset

2 oscillators n n n One for stress = Foot, one for syllables. The

n n If we plot the frequencies of Foot and Syllable against each other,

f=s 1: 2 Foot freq f = 1/2 s 1: 1 Syll frequence

Entrainment with m: n ratios. . . n is common enough; but what is

f=s Foot frequency 1: 2 f = 1/2 s 1: 1 Syllable frequency

Hayes’s generalizations n n Culminativity: each word or phrase has a single strongest syllable

Metrical grid x x x x x The height of the grid marks rhythmic

Goldsmith-Larson (dynamic computational) model Model syllables as units with an activation level; the strength

Some generalizations about prosodic systems of the world Very crude distinction between tone and

Light editing of Hayes’ typology of accentual systems. . . “Free versus fixed stress”:

Word-based generalizations (i. e. , not sensitive to word-internal morphological structure): Rhythmic versus non-rhythmic

Hayes’s typologies n n Free vs. fixed stress (predictable or not by rule) Rhythmic

Is the height of a metrical column a value of a variable? n If

Is constituency in metrical structure strongly motivated? #(x. ). . . # á a

Syllable weight Syllables divided into Heavy and Light syllables, primarily by the sum of

Hayes’ parametrical theory n Choice of foot type: – i. size (maximum: unary/binary/ternary/ unbounded)

Extrametricality n n n Units (segments, syllables, feet, …) can be marked as extrametrical…

Dynamic computational networks (Goldsmith, Larson)

Goal: to find (in some sense) the minimum computation that gets maximally close to

= Network M Input (underlying representation) is a vector Dynamics: (1) Output is S*:

Learnability n Larson (1992) showed that these phonological systems were highly learnable from surface

1. Introduction and overview; the cognitive task of language -generating and perceiving linguistic objects

Present two models today: n n Dynamic computational networks. Work done jointly with Gary

Moras and syllables (sequence of CVCVCV…) Moras, Syllables, and Stress

Slides: 115

Download presentation

Dynamic computational networks John Goldsmith University of Chicago April 2005

Work done in collaboration with Gary Larson (Wheaton College) Other work by Bernard Laks and his students (Paris X)

Two models in neurocomputing: 1. In space: lateral inhibition Work done jointly with Gary Larson. Discrete unit modeling. 2. In time: neural oscillation

The structure of linguistic time n n Language is not merely a sequence of one unit after another – at any level. In phonology, we have known since classical times about syllables and feet – whatever they are. What are they?

One view of syllables The Panini-Saussure view: Language is uttered in waves of increasing and decreasing sonority. Syllables are units that begin with a sequence of rising sonority, and end with a sequence of falling sonority. n

Pike-Hockett-Fudge-Selkirk n n The alternative to the wave view of the syllable was proposed by Pike (Pike and Pike 1947), who proposed: To apply Bloomfield’s syntactic model of immediate constitutents to phonology. Bloomfield was not amused. Hockett 1955 (among others) took this as a central fact about phonology: that all apparent phonological sequences were really hierarchical structure.

Accent n Metrical theory (Liberman 1975) came in two flavors: – Hierarchical theory (Liberman and Prince 1977) – Metrical Grid (Prince 1982) – The grid model emphasized the rise and fall of an unnamed quantity. – Halle (et collaborator) attempted to integrate constituency and the grid.

Immediate constituents (ICs) n n n So the granddaddy of the constituent theory of syllables and feet is the structuralist theory of IC. ICs were reformulated by Harris and by Chomsky as Phrase-Structure Grammars. What’s the central idea of PSGs? (and why should we care? )

PSGs n n Basic message: The structure in language does not pass from one terminal element to another, but flows up and down a tree. The structural link between two adjacent elements is expressible directly iff they are sisters:

a ‘det’ can be followed by an N because there is a rule NP det N; the generalization is through the mother category. This relationship is unchanged if there is a linearly intervening element:

PSGs are not designed to deal with relationships between adjacent terminal elements. That’s a hypothesis about the nature of syntax.

PSGs n n are designed to deal with structurally defined positions that can be recursively elaborated indefinitely. They are unnecessary for accounting for material that can be indefinitely expanded in a linear sense (i. e. , flat structure).

PSGs n Not good at dealing with distinct functions assigned to the same distributional categories in different positions (i. e. , marking pre-verbal NPs as subjects, post-verbal NPs as objects; distinguishing the functions of post verbal NPs; etc. )

Note what GPSG did: n n Split up PS rules into mother-daughter relations (immediate constituency) and left-right relationships. And in phonology?

What kind of structure do phonological representations need? n n Proposal: They need to be able to identify local peaks and global peaks of two quantities: sonority and accent. We need to build a model in which that computation is accomplished, and no other.

Original motivation for this particular model n n n Dell and Elmedlaoui’s analysis of Tashlhiyt Berber: There, the generalization appears to be that segments compete with their neighbors with respect to sonority, so to speak. In most cases, a segment is a syllable nucleus if and only if (iff) its sonority is greater than that of both of its neighbors.

n n We take that to be the central operation: search for elements for which a function takes on a peak value w. r. t. its neighbors. (discrete versions of 1 st and 2 nd derivative). To this, we add another hypothesis: that the value of the function may be influenced by its context.

Its context? n n Sonority: the inherent sonority of a segment may be influenced by its environment. The segments to its left and right may increase or decrease its sonority. Derived sonority is a function of both the inherent sonority and the sonority of the neighbors. Accent:

Accent n n The accent on an element is a function of both its inherent accentability (weight of the syllable = sum of the sonorities of the syllable or the coda) and its context: Context? A stressed syllable destresses syllables on either side; An unstressed syllable stresses syllables on either side. All part of the same computational system.

Syllabification and accent are not part of a general, all-purpose phonological computational engine.

Dynamic computational nets ¶ Brief demonstration of the program · Some background on (some aspects of) metrical theory ¸ This network model as a minimal computational model of the solution we’re looking for. ¹ Its computation of familiar cases º Interesting properties of this network: inversion and learnability » Link to neural circuitry

Let’s look at the program --

Initial activation Final activation

P(i) is the Positional activation assigned to a syllable by virtue of being (first, last) syllable of the word. That activation does not “go away” computationally.

Beta = -. 9: rightward spread of activation

Alpha = -. 9; leftward spread of activation

Examples (Hayes) Pintupi (Hansen and Hansen 1969, 1978; Australia): “syllable trochees”: odd-numbered syllables (rightward); extrametrical ultima: Ss Ss. Ss. Sss

Weri (Boxwell and Boxwell 1966, Hayes 1980, HV 1987) n n Stress the ultima, plus Stress all odd numbered syllables, counting from the end of the word

Warao (Osborn 1966, HV 1987) Stress penult syllable; plus all even-numbered syllables, counting from the end of the word. (Mark last syllable as extrametrical, and run. )

Maranungku (Tryon 1970) Stress first syllable, and All odd-numbered syllables from the beginning of the word.

Garawa (Furby 1974) (or Indonesian, …) n n Stress on Initial syllable; Stress on Penult; Stress on all even-numbered syllables, counting leftward from the end; but “Initial dactyl effect”: no stress on the second syllable permitted.

Two other potential parameters to explore: n n Penult activation Bias = uniform activation for all units

Why penult? Why not: in most cases, n negative Penult activation = positive Final activation, n positive Penult activation = negative Final activation But. . . n

Two reasons to consider Penult. . . (in addition to the fact that it’s easily learnable): ¬ One source for Antepenult patterns Explanation of two patterns of cyclic stress assignment

Two kinds of cyclic assignment Indonesian type: Stress the penult: …ss. Ss add a suffix: …. s s s S ] s add a suffix: …. s s ] S ] s

[ S s S ] òtogògráfi Versus [ [ S s s s S s ] ] kòn tin u a sí ña (A. Cohn) I = 0. 65 Pen = -1. 0 alpha = -. 4 beta = -0. 2 [ [ s s s 0. 31 -0. 72 -0. 85 ] ] n

Greek n n n a. [ Inherent Derived s 1 s 2 0 0 -0. 2 0. 5 s 3 -1 -1 ] b. [ Inherent Derived c. [[ Inherent Derived s 1 s 2 0 0 0. 06 s 1 s 2 0 0 -0. 12 0. 25 s 3 0 -0. 2 s 3] -1 -0. 5 s 4 ] -1 0. 5 -1 s 4 ] -1 -1 n n n n

Other type (Greek, …). . s s S s add a suffix: …s s S s ] s (stress doesn’t shift) add another suffix: …s s s s] S ] s

Q-sensitive: Latin stress rule Stress penult if it is heavy; otherwise, stress the antepenult.

Q-sensitive systems: an example: ultima or penult

An analysis a=-0. 2, b = 0. 8 Heavy syllables get D = 2. 0 If bias > 0 Yapese If bias < 0 Rotuman

= Network M Input (underlying representation) is a vector Dynamics: Quite a surprise! (1) Output is S*: equilibrium state of (1), which by definition is: Hence:

Please note. . . n This is not a system where you input a vector U, and watch in the limit

Inversion, again -- note the near eigenvector property Dynamics: Output is S*: equilibrium state of , which by definition is: Hence: U = S 0 M*S 0 S 1 (I is the identity matrix) S 2 S* = Sn M*S 1 M*Sn

Fast recoverability of underlying form This means that if you take the output S* of a network of this sort, and make the output undergo the network effect once — that’s M S* — [M’s a matrix, S a vector] and subtract that from S* — that’s (I-M) S* — you reconstruct what that network’s input state was. (This would be a highly desirable property if we had designed it in!)

Learnability Work done with Gary Larson, reported in his 1993 dissertation (University of Chicago) In a word: very learnable. Why? Because of the continuous character of the mapping from parameter space to prediction space: a small change in parameters leads to a small change in predictions. . .

a small change in parameters leads to a small change in predictions… in a sense, the opposite of a theory constructed to have a rich deductive structure. Because of continuity, ….

We used a variant of simulated annealing. Simulated annealing is usually used to find an optimal value in state space (with a given set of parameters learned during the learning phase). Its attractiveness is its ability to escape from local optima that aren’t globally optimal. We used a variant during the learning phase….

We establish an initial “temperature” of 100 degrees. Think of “temperature” as a measure of uncertainty. 100= no knowledge; 0 means no need to change one’s mind. Training: present forms with correct stress patterns. If the stress patterns are what the model predicts, decrease the temperature 1 degree….

Training: present forms with correct stress patterns. If the stress patterns are what the model predicts, decrease the temperature 1 degree…. If not, change the parameters (the parameter vector) in a random direction, for a distance proportional to the current temperature. Stop when it’s freezing.

We seek regions, not settings A setting P of parameters is a point, of measure zero. That setting maps onto a “phenomenological” characterization (I. e. , linguist-speak) HP. How big is the region that maps to HP. ? Picture….

The accessability of a metrical system is the measure of the region it maps from (its inverse image) in simple terms, its area (as a proportion of total area) That region is the inverse image of HP: the set of values that are realized as that kind of stress system HP: Penult stress, with alternating stress from right to left

The challenge of language: n n For the hearer: he must perceive the (intended) objects in the sensory input despite the extremely impoverished evidence of them in the signal -- a task like (but perhaps harder than) visual pattern identification; For the speaker: she must produce and utter a signal which contains enough information to permit the hearer to perceive it as a sequence of linguistic objects.

Never was there a better use of the phrase, “I have a story to tell…. ” Let’s try it anyway.

Let’s interrogate the visual system to see if any of its basic components offer means to do the computation we’re taking a look at today.

Visual context: edge detection Mach bands

Edge detection through lateral inhibition In a 1 - or 2 -dimensional array of neurons, neurons: n a. excite very close neighbors; n b. inhibit neighbors in a wider neighborhood; n c. do not affect cells further away excitation Activation here. . . Region of inhibition

DOGs Center-surround structures are often modeled as “difference of gaussians”: take 2 gaussian distributions of different variances (widths), and you get a sombrero.

Difference of gaussians = sombrero? See white board! The Web failed me.

Old news in phonology? Stress on initial syllable or penult has a demarcative function in phonology -demarking the word-edge for the hearer.

A brief run-through on lateral inhibition. . . n n Hartline and Ratliff 1957 in the horseshoe crab (Limulus) Lateral inhibition leads to contrast enhancement and edge detection, under a wide range of parameter settings. Early models used non-recurrent connections; Later models preferred recurrent patterns of activation. . .

Recurrent lateral inhibition Recurrent models include loops of activation which retain traces of the input over longer micro-periods. (Wilson and Cowan 1972; Grossberg 1973, Amari) Recurrent inhibitory loops also leads to circuits that perform (temporal) frequency detection.

Recurrent lateral inhibition n n …also leads to winner-take-all computations, when the weight of the lateral inhibition is great. Most importantly for us, as noted by Wilson and Cowan 1973, lateral inhibition circuits respond characteristically to spatial frequencies.

n Evolution of thinking about visual cell’s receptive field from simple characteristic field (Hubel & Wiesel) to spatial frequency detector (J. P. Jones and L. A. Palmer 1987 An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol. , 58(6): 1233 -1258. )

Initially lateral inhibition gives rise to edge detection, and classic Mach band phenomena. Observe how a recurrent (feedback) competitive network of lateral inhibition gives rise to a pattern of spatial waves.

DOGs suggest. . . Each syllable unit (in this model) has a receptive field. Looking at just its 1 st neighbor to left and right is the crudest simplification of such a model. The next step would be to add a second term for the “over-neighbor”. . .

Overneighbor term If DOG < 0 then we have sombrero pattern

DOG pattern Gives rise to interesting new patterns: for example, over a wide range of negative values for DOG ratio, and a > 0, F= -1, we get a robust antepenult stress pattern (demo). (This appears to be an edge-transient effect, like a large part of the effects seen in this model. ) n

To wrap up: things not spoken of 1. This has been a theory of stress lapse, not a theory of stress clash. Save that for another day. (2 nd model that the brain might use. ) 2. Tone languages 3. Constituents, mainly feet

Addendum: May 19, 1999 n n Let’s bring time and dynamical systems into the picture: by which I mean, computational time = real time; That excludes Right-to-Left system, but leaves open very many complex systems.

Quantity-sensitive L->R alternation As in Yup’ik: In a sequence of light (CV or CVC) syllable, stress even-numbered syllables: n da dá n o 0 o 0 n but you cannot skip a heavy (CVV) syllable:

da dá dáa da dá 0 1 1 0 1 Note that you reset the timing. Some systems reset the timing starting with the Heavy itself, others reset with the next syllable. n

2 oscillators n n n One for stress = Foot, one for syllables. The syllable oscillator is driven by the phonological substance (the consonants and vowels). We need a system with the following properties:

n n If we plot the frequencies of Foot and Syllable against each other, we want to find 1: 1 and 1: 2 are attractor states; when they are in 1: 1 relationship, every syllable is stressed; 1: 2, every second syllable is stressed. This sounds very familiar, but. . .

f=s 1: 2 Foot freq f = 1/2 s 1: 1 Syll frequence

Entrainment with m: n ratios. . . n is common enough; but what is different about this system is that time is of the essence! We don’t have 20+ cycles to hop from one attractor to the other: we have to do that in much less than one cycle.

f=s Foot frequency 1: 2 f = 1/2 s 1: 1 Syllable frequency

Simulation --

Hayes’s generalizations n n Culminativity: each word or phrase has a single strongest syllable bearing the main stress. TRUE IF THAT SYLLABLE IS USED TO MAP A TONE MELODY (ETL). Rhythmic distribution: syllables bearing equal levels of stress tend to occur spaced at equal distances. Stress hierarchies (Liberman/Prince): several levels of stress Lack of assimilation as a natural process

Metrical grid x x x x x The height of the grid marks rhythmic prominence. Each level may represent a possible rhythmic analysis (“layer”).

Goldsmith-Larson (dynamic computational) model Model syllables as units with an activation level; the strength of the activation level roughly corresponds to the height of the column on the metrical grid.

Some generalizations about prosodic systems of the world Very crude distinction between tone and non-tone languages. It’s easier to say what a tone language is; not clear that non-tone languages form a homogeneous group. They have accent/stress. . .

Light editing of Hayes’ typology of accentual systems. . . “Free versus fixed stress”: when is it predictable which syllable is accented. When it is predictable, what kinds of computation are necessary to make the prediction?

Word-based generalizations (i. e. , not sensitive to word-internal morphological structure): Rhythmic versus non-rhythmic systems In rhythmic systems, there are upper limits on how many consecutive unstressed syllables there may be. The usual limit is no more than 1. And the usual limit is no more than 1 consecutive stressed syllable.

Hayes’s typologies n n Free vs. fixed stress (predictable or not by rule) Rhythmic versus morphological stress – Morphological: boundary-induced versus use of morphological information to resolve competition n Bounded versus Unbounded stress (length of span of unstressed syllables)

Is the height of a metrical column a value of a variable? n If so, this would explain the Continuous Column Constraint: a grid is ill-formed if a grid mark on level n+1 is not matched by a grid mark on level n in the same column (an effect that shows up in several environments: in stress shift, in immobility of strong beats, main stress placement, in destressing).

Is constituency in metrical structure strongly motivated? #(x. ). . . # á a á a. . . (x. )#. . . á a á a # We could think of assigning trochaic feet, either from left to right or from right to left.

Syllable weight Syllables divided into Heavy and Light syllables, primarily by the sum of the sonority of the post-nuclear material in the syllable. Latin stress rule: n No stress on final syllables; n stress on antepenult if penult is light; else n Stress on (heavy) penult.

Hayes’ parametrical theory n Choice of foot type: – i. size (maximum: unary/binary/ternary/ unbounded) – ii. Q-sensitivity parameter – iii. Trochaic vs. iambic (S/W, W/S) Direction of parsing: rightward, leftward n Iterative foot assignment? n Location (create new metrical layer, new layer) n Extrametricality. . . n

Extrametricality n n n Units (segments, syllables, feet, …) can be marked as extrametrical… if they are peripheral (at the correct periphery)… and enough remains after they become invisible.

Dynamic computational networks (Goldsmith, Larson)

Goal: to find (in some sense) the minimum computation that gets maximally close to the data at hand. What structure is required in the empirically robust cases?

= Network M Input (underlying representation) is a vector Dynamics: (1) Output is S*: equilibrium state of (1), which by definition is: Quite a surprise! Hence:

Learnability n Larson (1992) showed that these phonological systems were highly learnable from surface data.

n A spatial sinewave. . .

n A spatial square wave. . .

1. Introduction and overview; the cognitive task of language -generating and perceiving linguistic objects 2. Linguistics: Metrical stress theory; Goldsmith-Larson model of metrical accentuation 3. Neuro-computation: Lateral inhibition in computational neurobiology 4. Neuro-computation: Neural oscillators 5. Linguistics: Quantity-sensitivity as phase-locking attractor states of a

Present two models today: n n Dynamic computational networks. Work done jointly with Gary Larson. Discrete unit modeling. Coupled harmonic oscillators, to deal with certain types of quantity-sensitive stress assignment (left-to-right only); utilizes attractor states of the dynamical system. Continuous modeling.

Moras and syllables (sequence of CVCVCV…) Moras, Syllables, and Stress