SaintPetersburg State University TEMPLATEDRIVEN KNOWLEDGE MINING KNOWLEDGE PROSPECTOR
Saint-Petersburg State University TEMPLATE-DRIVEN KNOWLEDGE MINING. KNOWLEDGE PROSPECTOR. NET Project team (Knowledge. Net) Anton V. Novikov Maxim V. Sigalin Alexey L. Smolyakov Dmitry G. Cherepanov Speaker Alexey L. Smolyakov Scientific Adviser prof. Vladimir V. Safonov
Project goals Flexible framework n Supporting different languages n Integration with Knowledge. Net n
Algorithm n n n Getting documents and first-step text analysis Morphological analysis of text blocks Semantic analysis of entities sets using templates Optimizing resulting graph Saving results
Getting documents and first-step text analysis … n n n Текстовый формат – это очень гибкий путь для описания различных типов информации… 1) Один 2) Два 3) Три Страна. Столица. Англия. Лондон. Украина. Киев. Getting documents from providers Divide document into articles (just text, list, table etc. ) Divide text into blocks
Morphological analysis of text blocks n Word( «Documents» ) Russian English … MRD XML … «Documents» current m. f. : Noun, plural «Document» base m. f. : Noun, singular n n Language recognition Morphological form recognition using dictionaries Creating entities Entity Class( «Document» )
Morphological analysis > Entities types > “Simple” entities n n Entity “separator". Example «. , ; : !? ()[]{}…» Entity “unknown" Entity “changeable". Example «good» Entity “relationship". Example «Planet Earth is LESS then Sun»
Morphological analysis > Entities types > “True” entities n n n Entity “class" (class). Example «document» . Entity “property". Example «useful» . Entity “datatype". Datetime ¨ Integer ¨
Semantic analysis > Goals n Class( «house» ) Property-Class Subclass n n Property( «comfortable» ) Class( «building» ) Property-Class Property( «brick» ) Creating relationships between entities Creating new entities Adding true entities into resulting graph
Semantic analysis > Relationship types n n n Relationship between property and class Relationship “subclass” Relationship “subproperty” Relationship “equality” Relationship between two classes Relationship “conditional rule”
Semantic analysis > Template description n Priority Pattern Handlers <Template Priority="10000" Pattern="#E. P #E. C , ? а? значить #E. P"> <Handler Name=“Property. Relationship" Arguments="0, 1" /> <Handler Name="Property. Relationship" Arguments="5, 1" /> <Handler Name="Conditional. Rule" Arguments="1, 0, 5" /> </Template>
Semantic analysis > Pattern description n n n Logical operands: «&» (and), «|» (or), «^» (not). Occurrence: not set (once), «+» , «*» , «? » #E. P, #E. C, #E. S, #E. U, #E. Int, #E. Date. Time #M. Noun, #M. Adjective, #M. Verb, … #W. Month, #W. Number, … - words holder #H. Class, …- clauses holder [#E. P #M. Adjective]+ [#E. C #M. Noun]
Semantic analysis > Pattern description > Words holder <Word. Holder Name="Month"> <Item Word=“JANUARY" <Item Word=“FEBRUARY" <Item Word=“MARCH". . . </Word. Holder> Value="1" /> Value="2" /> Value="3" /> Clauses holder <Clause. Holder Name="Class"> <Item Pattern="[#E. P #M. Adjective]* #E. C" Index="1" /> <Item Pattern="[#E. P #M. Adjective] , [#E. P #M. Adjective] #E. C" Index="2" /> </Clause. Holder>
Semantic analysis > Handlers n n n n Replace Create datetime entity Create «property-class» relationship Create «subproperty» relationship Create «conditional rule» relationship Create «class-class» relationship
Semantic analysis > Creating relationships Property( «useful» ) Class( «document» ) + <Template Priority=“ 4" Pattern="[#E. P #M. Adjective]+ [#E. C #M. Noun]"> <Handler Name=“Property. Relationship" Arguments="0, 1" /> </Template> = Property( «useful» ) «property-class» relationship Class( «document» )
Semantic analysis > Creating new entities Integer( « 7» ) Class( «December» ) Integer( « 2006» ) Class( «Year» ) + <Template Priority="11000" Pattern="#E. INT #W. Month #E. INT year"> <Handler Name="Replace" From="0" Count="4" > <Create. Entity. Handler Name="Create. Date. Time « Arguments="day=0, month=1, year=2" /> </Handler> </Template> = Datetime (7. 12. 2006)
Optimizing resulting graph Class( «vehicle» ) n Subclass Property-class Class( «transport» ) Property( «fast» ) subclass Class( «bus» ) Property-class n Removing redundant «subclass» relationships Removing redundant relationships between properties and classes
Saving results n n n Saving acquired knowledge into Knowledge. Net format Into OWL Saving (and loading) knowledge from own binary format files
Current project status Developed working prototype n Created test temples n Attached «Mrd» dictionary (Russian and English) n
Plans n n Support creating «compound» entities (compound from several words: «creation of human hands» ) Functionality extension (adding new entities, relationships, templates, handlers, …) Program for generating templates Developing good examples
? Contact information: smlkvalex@mail. ru http: //www. knowledge-net. ru http: //polyhimnie. math. spbu. ru
- Slides: 20