Information Query Formulation in a Slavonic Language and

  • Slides: 20
Download presentation
Information Query Formulation in a Slavonic Language and its Automatic Processing Experience from Polish

Information Query Formulation in a Slavonic Language and its Automatic Processing Experience from Polish and Czech in comparison to Western European Languages Petr Strossa University of Economics, Prague Department of Information & Knowledge Engineering TEL-ME-MOR/M-CAST Seminar, 2006

General Issue 86 Question/Answer Types and the basic idea of their recognition in texts

General Issue 86 Question/Answer Types and the basic idea of their recognition in texts [D. Laurent et al. , SYNAPSE, Toulouse] TEL-ME-MOR/M-CAST Seminar, 2006 2

Technology Priberam’s lexicon data structure Sinta. Gest software tool [Priberam Informática, Lisbon] TEL-ME-MOR/M-CAST Seminar,

Technology Priberam’s lexicon data structure Sinta. Gest software tool [Priberam Informática, Lisbon] TEL-ME-MOR/M-CAST Seminar, 2006 3

Question-Answer Pattern (Example) Question(WEIGHT) : Root("jaký")? Dist(0, 5) Weight. Noun = 20// Jaká je

Question-Answer Pattern (Example) Question(WEIGHT) : Root("jaký")? Dist(0, 5) Weight. Noun = 20// Jaká je hmotnost Země? : Wrd(jak) Weight. Adj = 20 // Jak těžký může být slon? : Wrd(kolik) Weight. Unit = 20 // Kolik kg má dospělý kapr? : Wrd(kolik) Root("vážit") = 20 // Kolik váží kapr? Answer : Weight. Noun Definition With Pivot Dist(0, 5) {Number 6 Weight. Unit} = 20 // Váha kapra může dosáhnout až 5 kg. : Pivot Dist(0, 5) Cat(V) Dist(0, 5) {Number 6 Weight. Unit} = 20 // Roční kapr může dosáhnout 5 kg tělesné váhy. ; Answer(WEIGHT) : Number 6 Weight. Unit = 20 ; TEL-ME-MOR/M-CAST Seminar, 2006 4

Definitions of Constants Used in the Previous Example Const Weight. Noun = Any. Root(hmotnost,

Definitions of Constants Used in the Previous Example Const Weight. Noun = Any. Root(hmotnost, hmota, "tíha", "váha", "zatížení"); Const Weight. Adj = Any. Root("těžký", "lehký"); Const Weight. Unit 1 = Any. Root(mikrogram, miligram, centigram, decigram, dekagram, hektogram, kilo, cent, megagram, miligram, tuna, "karát", pond, kilopond, megapond, libra); Const Weight. Unit 2 = Any. Wrd(mg, cg, dg, g, dag, deka, Dg, dkg, hg, kg, q, Mg, t, p, kp, Mp, lb, "lb. ", lbs, "lbs. ", cwt, "cwt. "); Const Weight. Unit = Any. Const(Weight. Unit 1, Weight. Unit 2); TEL-ME-MOR/M-CAST Seminar, 2006 5

General Observation • The conception and the tools designed to process Western European languages

General Observation • The conception and the tools designed to process Western European languages can be adapted to process Slavonic languages, as Polish and Czech. • Some basic differences between the language families must be kept in mind during such an adaptation! TEL-ME-MOR/M-CAST Seminar, 2006 6

The Abundance of Morphology • Nouns: 4 (!) genders, 2 numbers, 7 cases •

The Abundance of Morphology • Nouns: 4 (!) genders, 2 numbers, 7 cases • Adjectives: e. g. světlý (bright) – 3 degrees: – 4 genders: – 2 numbers: – 7 cases: světlý ↔ světlejší, nejsvětlejší světlý ↔ světlá, světlé světlý ↔ světlí světlý ↔ světlého, světlému, . . . TEL-ME-MOR/M-CAST Seminar, 2006 7

The Abundance of Morphology (2) • Adjectives Continued: – Theoretically every adjective may have

The Abundance of Morphology (2) • Adjectives Continued: – Theoretically every adjective may have 3*4*2*7 = 168 forms altogether! – Practically some of them are regularly (without exceptions) equal. . . – A general scheme for a morphology pattern description cannot work with less than 57 forms (= 3 degrees * 19 possibly differing gender/number/case endings). TEL-ME-MOR/M-CAST Seminar, 2006 8

The Abundance of Morphology (3): Illustration – the 19 Ending System TEL-ME-MOR/M-CAST Seminar, 2006

The Abundance of Morphology (3): Illustration – the 19 Ending System TEL-ME-MOR/M-CAST Seminar, 2006 9

The Abundance of Morphology (4) • Adjectives Continued: – In fact, not all of

The Abundance of Morphology (4) • Adjectives Continued: – In fact, not all of them may have all the forms. – Some adjectives cannot undergo gradation for purely morphological reasons: domácí (home, home-made) – Other adjectives usually do not undergo gradation for semantic reasons: jednofázový (one-phase) TEL-ME-MOR/M-CAST Seminar, 2006 10

Morphological Pattern (Ex. 1) Nom. Gen. Dat. Acc. Voc. Loc. Instr. Sg. babk babc

Morphological Pattern (Ex. 1) Nom. Gen. Dat. Acc. Voc. Loc. Instr. Sg. babk babc babk a y e u o e ou Pl. babk babek babk babk TEL-ME-MOR/M-CAST Seminar, 2006 y ám y y ách ami 11

Morphological Pattern (Ex. 2) Nom. Gen. Dat. Acc. Voc. Loc. Instr. Sg. <S> <S>

Morphological Pattern (Ex. 2) Nom. Gen. Dat. Acc. Voc. Loc. Instr. Sg. <S> <S> <S+S 0> <S> a y e u o e ou Pl. <S> <S+E> <S> <S> <S> TEL-ME-MOR/M-CAST Seminar, 2006 y ám y y ách ami 12

Morphology of Nouns: Some Statistics NUMBER OF PATTERNS NUMBER OF NOUNS FOLLOWING THEM PERCENTAGE

Morphology of Nouns: Some Statistics NUMBER OF PATTERNS NUMBER OF NOUNS FOLLOWING THEM PERCENTAGE 2 85 000 70 11 110 000 90 19 114 000 95 27 116 000 97 56 118 000 99 292 120 000 100 TEL-ME-MOR/M-CAST Seminar, 2006 13

Morphology of Nouns: Some Statistics (2) • We need about 300 noun patterns altogether.

Morphology of Nouns: Some Statistics (2) • We need about 300 noun patterns altogether. • We have about 90 noun patterns that describe the declension of at least 10 different nouns. • We have about 80 noun patterns that describe only 1 noun each. • About one half of the noun patterns describe the declension of 1– 3 nouns each. TEL-ME-MOR/M-CAST Seminar, 2006 14

Inherent Homonymy of Forms • A typical situation for our type of morphology: světlé

Inherent Homonymy of Forms • A typical situation for our type of morphology: světlé (bright) – – – nominative/accusative/vocative singular neuter genitive/dative/locative singular feminine nom. /acc. /voc. plural fem. acc. pl. masculine animate nom. /acc. /voc. pl. masculine inanimate i. e. 13 possible grammatical interpretations altogether! TEL-ME-MOR/M-CAST Seminar, 2006 15

Inherent Homonymy of Forms (2) • Only a little bit less typical situation: Ženu

Inherent Homonymy of Forms (2) • Only a little bit less typical situation: Ženu holí stroj. – I am setting a machine in motion with a stick. • OR: I am setting a machine of sticks in motion. (*) – The woman is shaved by a machine. – Dress the woman with a stick. • OR: Dress the woman of sticks. (*) TEL-ME-MOR/M-CAST Seminar, 2006 16

Inherent Homonymy of Forms (3) • All the previous once again – in a

Inherent Homonymy of Forms (3) • All the previous once again – in a question: Jaký je plat Petra Hanka? – What is the salary of XY? • X {Petr, Peter, Petar} • Y {Hank, Hanek, Hanke, Hanko} • The only thing we know for sure: X ≠ Petra (though such name exists); Y ≠ Hanka (though such name exists)! TEL-ME-MOR/M-CAST Seminar, 2006 17

Inherent Homonymy of Forms (4) Jaký je plat Petra Hanka? – What is the

Inherent Homonymy of Forms (4) Jaký je plat Petra Hanka? – What is the salary of XY? • The only thing we know for sure: X ≠ Petra (though such name exists); Y ≠ Hanka (though such name exists)! : Jaký plat Hanka dává svým zaměstnancům? – What salary does Hanka give to her/his employees? TEL-ME-MOR/M-CAST Seminar, 2006 18

Inherent Homonymy of Forms (Conclusion) • Due to our free word order, it is

Inherent Homonymy of Forms (Conclusion) • Due to our free word order, it is generally quite problematic to try any limited context disambiguation. • A really safe disambiguation is possible only after a complete syntactic analysis of a sentence (which should keep all the possible meanings of all the words up to the end). – (But we do not make complete syntactic analysis of sentences in M-CAST. ) TEL-ME-MOR/M-CAST Seminar, 2006 19

Free Word Order Again • How far is it to Brno? – Jak daleko

Free Word Order Again • How far is it to Brno? – Jak daleko je do Brna? – Jak je daleko do Brna? – Jak je do Brna daleko? – Do Brna je jak daleko? – Do Brna jak je daleko? – Do Brna je daleko jak? – Daleko je do Brna jak? (+++) (++) (+) (+) TEL-ME-MOR/M-CAST Seminar, 2006 20