Latent Semantic Mapping Dimensionality Reduction via Globally Optimal

Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda 1

Outline • Introduction • LSM • Applications • Conclusions 2

Introduction • LSA in IR: – Words of queries and documents – Recall and precision • Assumption: There is some underlying latent semantic structure in the data – Latent structure is conveyed by correlation patterns – Documents: bag-of-words model • LSA improves separability among different topics 3

Introduction 4

Introduction • Success of LSA: – – – Word clustering Document clustering Language modeling Automated call routing Semantic Inference for spoken interface control • These solutions all leverage LSA’s ability to expose global relationships in context and meaning 5

Introduction • Three unique factors for LSA: – The mapping of discrete entries – The dimensionality reduction – The intrinsically global outlook • The change of terminology to latent semantic mapping (LSM) to convey increased reliance on the general properties 6

Latent Semantic Mapping • LSA defines a mapping between the discrete sets – M: an inventory of M individual units, such as words – N: an collection of N meaningful compositions of units, such as documents – L: a continuous vector space – ri: unit in M – cj: composition in N 7

Feature Extraction • Construction of a matrix W of co-occurrences between units and compositions • The cell of W: 8

Feature Extraction • The entropy of ri: • Value of Entropy Close to 0 means that the unit is present only in a few specific compositions. • The global weight is therefore a measure of the indexing power of the unit ri 9

Singular Value Decomposition • The Mx. N unit-composition matrix W defines two vector representations for the units and the compositions • ri: a row factor of dimension N • cj: a column factor of dimension M • Unpractical: – M, N can be extremely large – Vector ri, cj are typically sparse – Two spaces are distinct from each other 10

Singular Value Decomposition • Employ SVD: • U: Mx. R left singular matrix with row vectors ui • S: Rx. R diagonal matrix of singular values • V: Nx. R right singular matrix with row vector vj • U, V are column-orthonormal – UTU=VTV=IR • R<min(M, N) 11

Singular Value Decomposition 12

Singular Value Decomposition • captures the major structural associations in and ignores higher order effects • The closeness of vector in L: – Unit-unit comparison – Composition-composition comparison – Unit-Composition comparison 13

Closeness Measure • WWT: co-occurrences between units • WTW: co-occurrences between compositions • ri, rj: units which have similar pattern of occurrence across the composition • ci, cj: compositions which have similar pattern of occurrence across the unit 14

Closeness Measure • Unit-Unit Comparisons: • Cosine measure: • Distance: [0, π] 15

Unit-Unit Comparisons 16

Closeness Measure • Composition-Composition Comparisons: • Cosine measure: • Distance: [0, π] 17

Closeness Measure • Unit-Composition Comparisons: • Cosine measure: • Distance: [0, π] 18

LSM Framework Extension • Observe a new composition , p>N, the tilde symbol reflects the fact that the composition was not part of the original N • , a column vector of dimension M, can be thought of as an additional column of the matrix W • U, S do not change: 19

LSM Framework Extension : pseudo-composition vector • If the addition of causes the major structural associations in W to shift in some substantial manner, the singular vectors will become inadequate. 20

LSM Framework Extension • It would be necessary to re-compute SVD to find a proper representation for 21

Salient Characteristics of LSM • A single vector embedding for both units and compositions in the same continuous vector space L • A relatively low dimensionality, which make operations such as clustering meaningful and practical • An underlying structure reflecting globally meaningful relationships, with natural similarity metrics to measure the distance between units, between compositions or between units and compositions in L 22

Applications • Semantic classification • Multi-span language modeling • Junk e-mail filtering • Pronunciation modeling • TTS Unit Selection 23

Semantic Classification • Semantic classification refers to determine which one of predefined topic a given document is most closely aligned with • The centroid of each clusters can be viewed as the semantic representation of this outcome in LSM space – Semantic anchor • A newly observed word sequence measures by computing the distance between the document and semantic anchor, and pick minimum 24

Semantic Classification • Domain knowledge is automatically encapsulated in the LSM space in a data-driven fashion • For Desktop interface control: – Semantic inference 25

Semantic Inference 26

Multi-Span Language Modeling • In a standard n-gram , the history is string • In LSM language modeling, the history is the current document up to word • Pseudo-document: – Continually updated as q increases 27

Multi-Span Language Modeling • An Integrated n-gram + LSM formulation for the overall language model probability: – Different syntactic constructs can be used to carry the same meaning (content words) 28

Multi-Span Language Modeling Assume that the probability of the document History given the current word is not affected by immediate context preceding it 29

Multi-Span Language Modeling 30

Junk E-mail Filtering • It can be viewed as a degenerate case of semantic classification (two categories) – Legitimate – Junk • M: an inventory of words, symbols • N: a binary collection of email messages • Two semantic anchors 31

Pronunciation Modeling • Also called grapheme-to-phoneme conversion (GPC) • Orthographic anchors – (one for each in-vocabulary word) • Orthographic neighborhood – In-vocabulary word with High closeness for outvocabulary word 32

Pronunciation Modeling 33

Conclusions • Descriptive Power – Forgoing local constraints is not acceptable in some situations • Domain Sensitivity – Depend on the quality of the training data – polysemy • Updating the LSM Space – SVD on the fly is not practical • Success of LSM for three characteristics 34