Electronic dictionaries Duko Vitas University of Belgrade Faculty
Electronic dictionaries Duško Vitas University of Belgrade, Faculty of Mathematics 1
One definition of a (traditional) dictionary o A dictionary is a book in which the words and phrases of a language are listed alphabetically, together with their meanings or their translations in another language. (Collins Cobuild, English Dictionary for Advanced Users) 2
From Dictionary. com o. . . a book, optical disc, mobile device, or online lexical resource containing a selection of the words of a language, giving information about their meanings, pronunciations, etymologies, inflected forms, derived forms, etc. , expressed in either the same or another language. . . Print dictionaries of various sizes, ranging from small pocket dictionaries to multivolume books, usually sort entries alphabetically. . . All electronic dictionaries, whether online or installed on a device, can provide immediate, direct access to a search term, its meanings. . . 3
Some definitions of e-dictionaries o According to one definition: (Schryver, 2003) “any dictionary that can be used in an automated environment” o The other definition says (Jacquet-Pfau, 2002) that “electronic dictionary intended for automated processing of texts (corpora) differ from machine-readable dictionaries that are intended to human users” 4
Characteristics of e-dictionaries o E-dictionaries have to fulfill two basic criteria: n They have to be formally established so that computer programs can process them; besides that, e-dictionaries complement grammars as all exceptions are listed in them. n E-dictionaries have to be exhaustive since they have to cover 100% of lexica of a language in question; a parser that processes a text should not be impeded by unknown words. This aim is difficult to achieve. n As opposed to grammars, an e-dictionary tends to desribe extensively lexical properties of lemmas. 5
Development of e-dictionaries o Can development of an e-dictionary rely on some excellent traditional dictionary? n Traditional dictionaries are often limited in size (e. g. for commercial reasons); n Information in them is often implicit – they rely on the belief that a human will easily supply all missing data, for instance, a human will correctly deduce a whole paradigm if offered one or two inflective endings. n Information is often partial (e. g. in Serbian a noun otac has two possible plural forms očevi, oci; for automatic processing it is necessary to explicitly know whether it is possible to say: ? očevi nacije ‘national founding fathers’, ? oci dece ‘fathers of the childrens’, even Očevi i oci (title of a novel) 6
From a list of words to an edictionary o Many computer scientists in the past thought that a list of words taken from a traditional dictionary is good starting point for the development of an e-dictionary. o This attitude was influenced by work done for English which is not a typical example of an European language (from the point of view of the automatic processing, because of its modest inflection). o Before one should start to develop an e-dictionary it should be clear what is going to be its basic unit (lemma), and then how its other forms can be generated from it. 7
Defining a basic unit of an edictionary o Automatic text processing usually begins with simple words as basic units of texts. This is a natural starting point because they are formalized for most of European languages. However, simple words are not always a natural unit of processing, because they are: n ambiguous (dictionaries offer for them several meanings); n pointless (many terms have several constituents, and each of them does not contribute directly to the meaning of a term) o Because of that dictionaries of simple words have to be complemented with other types of dictionaries and grammars that will provide a natural units of processing. 8
Types of e-dictionaries o E-dictionaries of simple words (dictionaries of simple graphemic units– these are usually entries in traditional dictionaries); o E-dictionaries of multi-word units (multi-word units that contain non-letter characters, terminology, collocations, phrases, . . . ); o Phonological e-dictionaries (pronunciation of simple words, with rules of how to pronounce inflected forms, words in contact, etc. ); o Semantic e-dictionaries (simple words and multi-word units with encoded senses – network of senses? ) 9
A prerequisite for the development of an e-dictionary of simple words o A selection of lexical categories and a way to represent them. Traditional categories: n Part-of-Speech: noun, verb, adjective, . . . n subordinated categories: possessive, indefinite, . . . n inflectional categories: masculine, feminine, neuter, nominative, genitive, . . . n syntactic categories: transitive, intransitive, . . . n semantic categories: human, abstraction, concrete object, . . . 10
The selection of categories is not a straightforward task o A sat of tags used to annotate the Brown corpus Brown o A sat of tags used to annotate the Penn tree bank Penn o A sat of tags used for the Multext-East project Multext-East 11
LADL format of electronic dictionaries o Unitex works with dictionaries that were developed by members of the Relex network. o Relex is an international informal network of laboratories that work on computational linguistics. It was established by Maurice Gross and his LADL team. (LADL is shortened for Laboratoire d'Automatique Documentaire et Linguistique) o Members of the Relex network developed exhaustive edictionaries of simple words and compounds (http: //infolingu. univ-mlv. fr/Relex. html) 12
A selection of canonic forms o In the case that a word has several surface forms, one of them is chosen as a canonic representative for other, subordinate forms. o What are canonic forms in e-dictionaries of French? n For nouns, as a rule that is the singular masculine form; n For verbs, that is the infinitive form. . . 13
Is the selection of a canonic form unique? o It is neither simple nor unique. For instance, n In French, the gender of nouns is an inflectional category, that is lecteur, lecteurs, lectrices are four forms of the same word – its canonic form is lecteur n In Serbian, the gender of nouns is not an inflectional category; so, učitelj and učiteljica (traditionally, as well as in the Serbian e-dictionary) are two canonic forms, each with its own subordinated forms. n Similarly in Bulgarian: учител and учителка 14
Why is the adequate selection of an canonic form so important? o A lot of information about a word is attached to its canonic form – all subordinate forms share that information: n učiteljica has semantic features +Hum+Prof and the same features have all its inflected forms: učiteljice, učiteljici, učiteljicu, . . . o Is this a rule that release us from further from making other decisions? No, in Serbian the gender and the animacy are features of subordinate forms, not canonic forms. Why? n Nouns can change gender in plural forms, vladika (m) vladike (f) 15
More about categories attached to canonic and subordinate forms o First, there was a mouse n n This mouse is alive Its canonic form is miš, N+Zool o Then came a mouse n n This mouse is not alive Its canonic form is miš, N+Conc o What is the value of the grammatical category “animacy” for this new mouse? Google: 19, 000 Da biste se prebacili na sledeće poglavlje DVD-a, pomerite miša Da biste kontrolisali reprodukciju televizije uživo, pomerite miš kako bi se prikazale kontrole za reprodukciju Google: 28, 300 16
More on the selection of a canonic form o Passive past participles are not separate entries in Serbian traditional dictionaries – these forms belong to the verb paradigm. o What about passive past participles that are used as adjectives? A program for automatic text processing has to recognize them somehow and to tag them appropriately. n For instance, a sample of “Politika” having 582, 000 simple word tokens contains only in the feminine gender accusative 228 adjectives derived from the past participle (they are not all correct) 17
And what about. . . o Present past participle (functioning as an adjective) – “Politika” – occurrences in the feminine accusative singular forms; o Present gerund (functioning as an adjective) – Politika – occurrences in the feminine accusative singular forms o Derivational forms n n Possessive adjectives – dečakov, partizanov, . . . Diminutives – tkaninica, futrolica, telefonče, . . . Gender motion – druidica, gutačica, gudačica, guvernerka, . . . Šezedestogodišnjakinja, četvoroipomesečni, dvestopedestogodišnjica, . . . o They all have in Serbian e-dictionary separate canonic forms, each with its own subordinate forms. 18
In order to obtain (close to) 100% coverage of a text, it is necessary to include: o colors – skerletnocrven, bledoplav, mlečnožut, . . . o Proper names – n n n personal names, geopolitical names organizations (Ozna, Gestapo, Metropoliten, . . . ) objects – trademarks (lajka, spitfajer, mercedes, . . . ) Titles and characters of novels, films, operas… (Dezdemona, Asteriks, Plavobradi. . . ) n events (Anšlus, . . . ) o And then also – donžuanstvo, arsenlupenovski, neotitoizam, nedićevština, . . . 19
Details of the LADL format o There are two dictionaries (or lists) of simple words in the LADL format: n First dictionary – DELAS – is a dictionary of canonic forms (lemmas). This dictionary is used to generate the second dictionary. n Second dictionary – DELAF – is a dictionary of subordinate (or inflected) forms. Only this dictionary is used in the automatic text processing. 20
An entry in a DELAS dictionary lemma, Kn+Prop o K: A Part-of-Speech code; n Usually that is a code consisting of one or more upper-case letters. o n: A relation with subordinate forms, if they exist; n Usually that is an alphanumeric code that together with a Po. S code enables the generation of all subordinate forms for a DELAF dictionary. o Prop: Syntactic, semantic, dialect, usage, domain, … markers n Markers that can be freely attached to any canonic form – they are in a 21 form of alphanumeric codes.
An example of a DELAS entry from the Serbian e-dictionary učiteljica, N 651+Hum+GM n učiteljica n N n (N)651 n +Hum n +GM canonic form (lemma) Part-of-Speech (noun) Inflection class code used to generate all inflected forms human feminine gender noun derived from the corresponding masculine gender noun učitelj 22
Examples of Serbian DELAS entries for various Po. S o o o o o učiteljica, N 651+Hum+GM zagasitocrven, A 6+Col smejati, V 516+Imperf+It+Ref+Ek ćutke, ADV deset, NUM+v 5 poneko, PRO+Pro. N+Indef+Sr ali, CONJ od, PREP+p 2 jaoj, INT naime, PAR 23
An example of a DELAS entry from the Bulgarian e-dictionary глава, C 600+Ж n глава n N n (N)600 canonic form (lemma) Part-of-Speech (noun) Inflection class code used to generate all inflected forms n +Ж feminine 24
An entry in a DELAF dictionary: word form, lemma. K+Prop(: gc)* n Canonic form (or lemma); n K: A Part-of-Speech code (inherited from its lemma) n Prop: Syntactic, semantic, dialect, usage, domain, … markers (inherited from its lemma) n gc: A set of codes that represent values of grammatical categories describing a form: o Grammatical categories depend on the Po. S; o These are one character alphanumeric codes. 26
An example of a DELAF entry from the Serbian e-dictionary učiteljicu, učiteljica. N+Hum+GM: fs 4 v n n n učiteljicu učiteljica N +Hum+GM fs 4 v o f o s o 4 o v subordinate form (realization) canonic form (lemma) Po. S (inherited from the canonic form) markers (inherited from the canonic form) values of grammatical categories: category gender (value feminine) category number (value singular) category case (value accusative) category animacy (value animate) 27
The whole paradigm of the lemma učiteljica o o o o učiteljica, učiteljica. N: fp 2 v učiteljica, učiteljica. N: fs 1 v učiteljice, učiteljica. N: fp 5 v učiteljice, učiteljica. N: fp 4 v učiteljice, učiteljica. N: fp 1 v učiteljice, učiteljica. N: fs 5 v učiteljice, učiteljica. N: fw 4 v učiteljice, učiteljica. N: fw 2 v o o o o učiteljice, učiteljica. N: fs 2 v učiteljici, učiteljica. N: fs 7 v učiteljici, učiteljica. N: fs 3 v učiteljicu, učiteljica. N: fs 4 v učiteljicom, učiteljica. N: fs 6 v učiteljicama, učiteljica. N: fp 7 v učiteljicama, učiteljica. N: fp 6 v učiteljicama, učiteljica. N: fp 3 v The numeric code 651 that connects a canonic form with all of its subordinate forms is deleted because it is of no use anymore. 28
How is relation between canonic form and its subordinate form (inflected forms) established? o In Unitex system Finite State Transducers – FST – are used for this. o Inflection class code used that follows Po. S code in DELAS (dictionary of lemmas) is used to generate all inflected forms. o One transducer is usually used to generate forms for many lemmas. For instance, transducer N 2 generates inflected forms for: emir, evrofil, dijetetičar, forenzičar, leptir, šegrt, and many other lemmas. 31
FST defines classes (BG) o In most of the languages the relation between a lemma and its forms is an intuitive relation of equivalence that is formalized, in the case of the LADL format by FSTs, n син/sg, indef n синове/pl, indef n сине/sg, voc син(<E>/sg, indef+ове/pl, indef+е/sg, voc+а/pl, count) n сина/pl, count (. . . ? . . . )(<E>/sg, indef+ове/pl, indef+е/sg, voc+а/pl, count) N 01: (<E>/sg, indef+ове/pl, indef+е/sg, voc+а/pl, count) 32
Dictionaries for other languages o Russian - developed at CIS, Munich (CISLEXRU) derived mostly from Zaliznyak, A. Grammaticheskij slovar' russkogo jazyka) and contains approximately 44, 000 lemmas (930. 000 forms) 33
Капе, Капа. N+PN+VORN+anim(o)+ge n(M)+style(colloq): de. M: qe. M Капе, Капа. N+PN+VORN+anim(o)+gen(M) +style(colloq): de. M: qe. M n n n n n Капе – word form Капа - lemma N – noun PN – proper noun VORN – given name anim(o) - animate gen(M) – masculin gender style(colloq) de. M: qe. M – dative or prepositional case, singular (e), masculin 35
Dictionaries for other languages o Polish (Z. Vetulany, Adam Mickiewicz University, 1996) marcu, marzec. N+Gi+Ns+Cl marcu, marzec. N+Gi+Ns+Cv marcu, marzec. N+month: L marynarka, . N+Gf+Ns+Cn masową, masowy. ADJ+Dp+Ns+Cai+Gf masowe, masowy. ADJ+Dp+Np+Cnav+Gaifn 36
Dictionaries for other languages Latin dictionary derived from Perseus project, based on the Lewis&Short dictionary (1879) abaculus, . N: Nms abacum, abax. N: Gmp abacum, abacus. N: Ams abacum, abacus. N+poet: Gmp abaddon, ab-addo. V: 1 si. PC abagmentum, . N: Vns abagmentum, . N: Nns abagmentum, . N: Ans 37
Comparison between L&S, Georges and Whiteker dictionaries All three dictionaries are available in e-from. o L&S supports processing on the site Perseus o Georges is available on-line o Whitaker’s Words is an application that performs morhological anaylsis But their content is different. E. g. abacinus exists in Georges and Whitaker, but not in L&S 38
What else should be known? o The use of upper-case and lower-case letters in a dictionary: n Canonic forms written with lower-case letters can match in a text both lower–case and upper-case occurrences. n Canonic forms written with (some) upper-case letters can match in a text only occurrences that use upper-case letters on that position(s). n For instance, o vlada, N 600 o Vlada, N 1741+NProp+Hum+First n some results from corpus “Politika” o vlada and Vlada 39
What else should be known, or not? o A user that will not produce a new dictionary (e. g. for a new language, or a dictionary for some sub-domain) need not know the format of DELAS dictionaries, neither he/she has to know what are inflectional transducers and how some of them look like. o A user that wants to use dictionaries for text processing needs to know what is the content of DELAF dictionaries he plans to use and what does different codes and markers mean. o Dictionary that he/she is using are compiled dictionaries (two files with the extensions. bin and. inf) and their usage by Unitex is very effective. These dictionaries cannot be “seen”. 40
E-dictionary as statistical tagger Filtering the results of word form tagging by Tnt, Tree. Tagger, etc. with edictionaries transform the results to „real“ lemmas (a part of ambiguity is lost, but the result is >95% correct : -) 41
Numbers that illustrate the content of Serbian e-dictionaries o The number of inflection transducers (April 2010) n for nouns 369 n for verbs 371 n for adjectives 66 42
- Slides: 42