Deep Learning for Math Knowledge Processing Abdou Youssef

Deep Learning for Math Knowledge Processing Abdou Youssef and Bruce Miller 1

Background and Drivers (1) -- Deep Learning Capabilities-- • Deep Learning’s (DL) Unprecedented Breakthroughs in NLP ØModeling, representation, and semantics ØMachine translation ØSpeech recognition (and many other applications) • Technical Advances of DL (capitalizing on large datasets) ØLearning to represent better, capturing relational semantics ØLearning to do sequence-to-sequence mapping ØLearning to forget, to remember, and to focus attention 2

Background and Drivers (2) -- Math Needs-- • Math needs similar capabilities: ØRepresentation and semantics ØTranslation (with high accuracy) Presentation to Computation (P 2 C) Visual to Digital (V 2 D) • Latex/p. MLL c. MML/Programs • Informal Formal ØSoftware • PDF scan MML/Latex • Hand-written MML/Latex § End-to-end apps § Basic building blocks for people to synthesize new apps 3

Goals of this Project -- short Term -- • Apply Deep Learning to: ØMath-entity representation learning (Math 2 Vec) ØMath semantics extraction and disambiguation Ø Create algorithms and public-domain software § For the above two tasks § Trained models ØCreate publicly available datasets (and APIs) § Labeled data § Math 2 vec representations 4

Goals of this Project -- Long Term -- • Applications ØSemantic enrichment/annotation of math expressions ØP 2 C conversion of math expressions ØMath QA capabilities per manuscript/collection ØEnhanced math search, UI, authoring aid ØEtc. 5

Related Work • Math OCR and formula recognition • Math conversion ØTe. X/La. Te. X Øp. MML XML/HTML/p. MML c. MML Can we these same tasks better with Deep Learning? • Math search • Math semantification/annotation • Machine learning: doc classification, topic modeling • Math computational linguistics • General NLP/computational linguistics • Some use of DL already 6

Our Related Ongoing Projects • La. Te. XML ØParses Te. X/La. Te. X and converts it to p. MML and XML ØWherever possible, tags math tokens with roles/meanings • Part-of-Math Tagger ØTokenizes Te. X/La. Te. X with some parsing ØAssigns to the math token many tags § Some definite and some tentative (from common uses) ØIn later stages, it disambiguates between tentative tags • This DL project aims to provide/disambiguate semantics 7

Token type Explanations and Examples Numbers Letters, Alphabetic Strings Operators, Operations Relations Roman, Greek, Hebrew Part of Math numeric quantity; index; reference function; variable; argument; parameter; index; identifier Unary operations and operators: −, +, ±, ∓, !, , ¬, ~, diff. (d, ∂, ∇), integration (e. g. , ∫, ∬, ∮, ⨖, ⨑), transforms, lim, inf, … Binary operations: −, +, /, ∧, ∨, ⋂, ⋃, ⊕, ⊗, ⊙, ⋈ … Multi-ary operations: Σ, Π, ⋃, ⋂, ⨄, ⋁, ⋀, ⊕, ⊗, ⊙ … Equalities/defs, approx, sim, equiv, congruence: =, ≜, ≡, ≃, ≈, ≐, ≑, ≒, … Inequalities: <, >, ⋖, ⋗, ≼, ⊀, ≾ … Set-theoretic relations: ⊂, ⊆, ⊉, ⊈, ⊉, ⋐, ⊒, ∈ … Logic relations: =>, Turnstile relations: ⊦, ⊨, ⊩, ⊫, ⊪, ⊬, ⊭, ⊮, ⊯, … Triangle-shaped inequalities: ⊲, ⊳, ⊴, ⊵, ⋪, ⋫, ⋬ … Geometry/linear-algebra relations: ∥, ∣, ⊥, ≡, ≃, ≈ Negated binary relations: ≠, ≢, ≄, ≴, ∦, ∤, ⊭, ⊈ Miscellaneous: “divides” (∣), prop (∝), etc. Operator; operations of various arities Relation (of various kinds as indicated in the left box in bold) 8

Token type Fence symbols Logic tokens Punctuations Math accents Explanations and Examples Delimiters (grouping symbols): ( ) [ ] { } | || etc. Constructors: for creating/denoting sets, vectors, intervals Distributed multi-glyph (DMG) operators: |. |, ||. ||, inner product “<. , . >”, ket “|. >”, etc. Quantifiers: ∀, ∃, ∄, ∃!, � , ◊, etc. Proof tokens: ∴ (therefore) , ∵ (because) , ⋇ ( contradiction) “, ”, “; ”, “: ”, “|”, “ ”, “/”. They can be simple, punctuations, separators between elements/args, implied conjunctions and conditionals, or glyphs in DMG operators Diactritics: Overlines, underlines, hats, checks (i. e. , upsidedown hats), tildes, single/multiple dots, rings, acute/grave/breve accents, arrows/harpoons (e. g. , for vectors), one or more prime (e. g. , as postfix unary operators for differentiation) Grouping accents: horizontal { }[ ]( ) etc. , used under and over sub-expressions Extensible accents: adding symbols above/below the accents for further semantification of accents. Part of Math left-delimiter; rightdelimiter; constructor; distributed multi-glyph operators quantifier; proof token The designations in the left box (in bold) accent (of various types) 9

Token type Literals, Constants Explanations and Examples standard sets (e. g. , ℕ, ℤ, ℚ, ℝ); infinities (e. g. , ∞, ⧜, ⧝, ⧞); empty set (∅); �� and �� (for ); various “standard” functions (e. g. , sin, cos, sinh, Part of Math The designations in the left box(in bold) log, exp, etc. ); and math constants such as π (3. 1415…), γ (Eulerian gamma), φ (the Golden ratio), etc. They come in various orientations, directions, valences (single/double/triple), head and tail shapes, line type and shape. Used in Arrows logic, geometry, function mapping, category theory, etc. Harpoons: ⇁, ↽, ↼, ↾, ↿, etc. in geometry, vector analysis Various shapes Smiles and frowns: ⌣, ⌢, ≍, etc. in topology and geometry Spoons: ⊸ as “multimap”, ⟜, ⊷ as “image of”, and ⊶ as “original of”, etc. Used in function mapping Pitchforks (e. g. , ⋔) used in manifolds Angles: ∠, , ∡ , ∢, etc. in geometry Triple dots (e. g. , …, ⋮, ⋰, ⋱) to designate missing terms in finite or infinite Ellipses sequences, vectors, matrices, etc. ⊤(top), ⊥ (bottom), ℏ (hslash), Ⅎ, ⅄, ⅁, etc. “Other” arrow of various types (later scansrefine this tag) The designations in the left box (in bold) ellipsis with its orientation symbol 10

Features (of Tokens and Phrases) Feature Name Category/Role Explanations / Values The grammatical role or part-of-math: operation, operator, relation, function, variable, parameter, constant, quantifier, separator, punctuation, abbreviation/acronym, delimiter, leftdelimiter, right-delimiter, constructor, accent, etc. Subcategory Further specializes the category: subscript, superscript, numerator, denominator, lower-limit (of an integral), constraint/condition, definition, etc. For an accent, indicates the accent position. Meaning Examples: scalar addition, the cosine function, etc. Signature Data types Font The font characteristics: Typeface, Font-style, and Font-weight Notational Status specifies whether the notation is Generic, Standard (i. e. , meant as commonly understood), or Defined (in the manuscript). 11

Math Ambiguities Superscript Explanations a power, an index, the order of diff, a postfix unary operator, etc. Juxtaposition multiplication, function application, or concatenation Accent an applied operator, or a morphological part of the name. y ’ : Derivative of y? Complement of y? A distinct variable Part of math Different roles mean distinct parse trees and different semantics. Ex: | and || can be punctuation, operators, relations, or delimiters. Scope Typically occurring when delimiters are omitted. E. g. ‘sin 2πx + 5’ : ‘sin(2πx) + 5’ more probable than ‘sin(2πx + 5)’ Data type Necessary to completely resolve semantics; conversely, can help disambiguate other ambiguities, e. g. Superscript 12

Deep Learning Models --Relevant to Math-- • Embedding (Feature Learning) ØConverts each math term (or expression) into a numerical feature vector • Feedfoward classifiers ØGood for document classification, & disambiguation between alt. tags • Recurrent Neural Networks (RNNs) ØInput: a variable-length, ordered sequence (sentence, expr. , equation) ØOutput: a class, or another sequence (translation or annotation) • Advanced RNNs ØBidirectional LSTM: learn what to forget, and what to remember ØRNNs with Attention: Learn what to focus on at any given time 13

Embedding • Converts each math entity into a feature vector • ~ Embedding each entity as a point in a vector space • Similar/related entities map to algebraically similar/related vectors ØV(king)-V(queen) ØV(France)-V(Paris) ØV(cos)-V(arccos) ≈ V(man)-V(woman) ≈ V(Britain)-V(London) ≈ V(exp)-V(log) (? ? ) 14

Schematic of an Embedder (1) -- Input: a single word/symbol -- One-hot vector Input W: • Word or • Symbol The size of the vocabulary. “ 1” in pos. of W 0 0 0. . . 0 1 0 0. . . 0 Embedder • Word 2 Vec • Glo. Ve • Doc 2 VEC • Etc. Numerical Feature Vector of W 100 d or 300 d n 1 n 2 n 3. . . nk A more meaningful representation of W, used henceforth as input Trained on a dataset of text/math. to other models The set is large/small, generic/specialized 15

Schematic of an Embedder (2) -- Input: tagged word/n-gram -- Word / Symbol Input W: • Tagged Word • Tagged Symbol • N-gram tags 0. 0 1 0. 0. . . Numerical Feature Vector of input Embedder n 1 n 2 n 3. . . nk Can be: • organically new vector • concatenation/mean of individual gram vectors 16

Embedders (Math 2 Vec) in this Project • Datasets: ØCollections of math papers from the Ar. Xi. V, grouped into areas of math ØThe DLMF pages (a single class: special functions) • Different embeddings ØEmbedding of individual symbols and words, tagged and untagged ØEmbedding of N-grams, expressions, equations • Software ØText+math tokenizer: Add-on to the La. Te. XML and the POM tagger ØA variety of fine-tuned, synthesized math embedders 17

Math-tag Disambiguation Alg. • More accurate if • R is limited to docs of same class as D • Contexts are embedded with terms • KNN is used instead of 1 NN in step 3. ii 18

• More accurate if • R is limited to docs in the same class as D • Larger contexts are embedded with terms • KNN is used instead of 1 NN in step (4)19

Retrospective and Prospective • The previous approach requires tentative tags and KNN search • Can we do better than KNN? ØPossibly: train traditional classifiers (NN, SVM, RF) to classify each term § the classes are the possible tag values § the feature vectors are n-gram embeddings • Prospective: ØCan a model be trained to find the definite tags directly? ØAnswer: Probably, using Recurrent Neural Networks 20

Recurrent Neural Networks (RNNs) • The input is an variable-length sequence ØSentence, phrase, n-gram ØEquation, math expression ØNote: each entity in the sequence is represented by its embedding • RNN Characteristics ØRemembers something (a state) derived from the subsequence thus far ØIncorporates that state in the processing of the next term in sequence 21

• Input: Schematic of RNNs X(1), X(2), …, X(n) Output: O(1), O(2), …, O(n) • t: term index (time) • Each RNN cell (the big circle) has § 2 input and 2 output • Hidden states: h(t)=σ(U. X(t)+W. h(t-1)) • Output: O(t)= σ(V. h(t)) or O(t)= softmax(V. h(t)) • U, V and W are param matrices optimized through training • σ is the sigmoid 22

LSTM Cells • More elaborate than RNN cells • Has 3 input, 2 output • Has gates to control how much Output of Cell t-1 O(t-1) O(t) Øof the previous cell’s outputs to use Øof its output to go to the next cell * • Let Y(t)= [h(t-1), X(t)] • O(t)=O(t-1)*σ(Wf. Y) + σ(Wi. Y) *tanh(W. Y) • h(t)=O(t)* σ(Wo. Y) • Wf , Wi , Wo are parameter matrices optimized by training + * σ σ Wf h(t-1) State of Cell t-1 Wi * tanh σ W Wo h(t) Y X(t) LSTM Cell t 23

LSTM without Attention • Each word is represented as an embedding vector • The bottom state h (a vector) encodes the entire sentence • The 2 nd column of LSTM cells decodes h into a new sequence • Every decoding LSTM cell takes h and the previous decoder state as input to decode the next word 24

LSTM with Attention • Each attention model computes a weighted average of the encoder states • By carefully adjusting the weights, the attention model decides which of the encoder states to focus on. • The weights, at decoder time t, are controlled by the previous state of the decoder, h’t-1 25

Use of LSTM-w. A in this Project • Input: Math expressions and equations • Output: The tags / role / meaning of each input term, OR • Output: c. MML, or CAS program, or formal math • The training: Needs large datasets of labeled documents • Datasets: ØThe DLMF, with labels produced from current annotations ØA lot more documents (Ar. Xi. V), labeled by community efforts? • To the public: the labeled dataset + the trained models 26

Summary (1) • Math 2 Vec embedding of terms, exprs, eqs, and n-grams ØAre good algebraic/numerical representations ØGood for search and uncovering relation b/w math entities ØCan be used for clustering ØCan be used to train classifiers for various tasks, such as § tag disambiguation § document classification 27

Summary (2) • RNNs, especially Bi. LSTMs with Attention ØCan directly tag each input term in a math expression ØCan directly translate to c. MML/CAS/formal math 28

Summary (3) • What this project does and will do • Collect large datasets of math documents • Label some of those documents • Adapt and train embedders for math • Compute embeddings of math terms • Use embeddings for tag disambiguation • Adapt and train RNNs w/ attention to directly tag math terms • Evaluate and optimize performance of those models • Make available to the public: ØLabeled datasets, embeddings, trained models, software 29