CONCEPT MODELING A Research Review ore Popovi Ognjen

CONCEPT MODELING: A Research Review Đorđe Popović, Ognjen Šćekić, Veljko Milutinović IPSI Belgrade for

What Is Concept Modeling? • A way of modeling reality: § Identifying concepts §

Challenges • How to create a model that has a uniform structure, and is

7 Ws – PROs and CONs Which When What Ultimate Goal: Why concept From

Why Start with Patents? • Described by a very formal, structured language – claims.

Structure of a Patent Document General info about the patent Description – primary not

Conceptual Indexing (1) • What is conceptual indexing? “New technique for organizing information to

Conceptual Indexing (2) Conceptual indexing technology is a combination of: • Concept extractor Identifies

Hybrid Approach: Indices + RDF/OWL • Conceptual indices • RDF/OWL • Motivation: Use the

Conceptual Indices vs. RDF/OWL Major advantage s: Major drawbacks : Conceptual indices RDF/OWL ontologies

Why not Use Ontologies Alone? • If we want to use an ontology we

Why not Use Indices Alone? • For example, let us take the simplest possible

Hybrid Approach (1) • An index of associations represents a simple model, similar to

Hybrid Approach (2) • It is important to keep track of how many times

Hybrid Approach (3) • However, this is only because we know what “synthetic grass”

Patent Model – Conceptual Index • A patent’s Claims section is scanned and processed

Patent Model – RDF/OWL • For a different application, a different RDF/OWL model needs

Patent Model – Creation Figure 2 – Creation of a patent model: Claims section

Patent Model – Result Figure 3 – Patent model: WHAT associations are contained in

Patent Model – Big Picture • Descriptive indices are re-processed by the Conceptual indexer,

Patent Model – Patent Relations Two ways of establishing relations among patents: • Via

Patent Model – Implicit Links (1) • Descriptions of similar concepts (patents) usually make

Patent Model – Implicit Links (2) • For example: When describing two different vaccines

Advantages & Drawbacks • Advantages § Reduced complexity (a great reduction of direct links

Conclusion • Our idea is still in the first stage of development. • Its

References § W. A. Woods, L. A. Bookman, A. Houston, R. J. Kuhns, P.

CONCEPT MODELING: Revisited with Details A Proposed Hybrid Approach to Patent Modeling Đorđe Popović

Initial Assignment • January 2006 Initial assignment: Get acquainted with different ways of Concept

Structure of a Patent Document General info about the Description Reference sthe to related

Conceptual Indexing • What is conceptual indexing? “New technique for organizing information to support

Conceptual Indexing Conceptual indexing technology is a combination of: • Concept extractor Identifies phrases

Patent Model – Implicit Links • Descriptions of similar concepts (patents) usually make a

New Assignment • May 2006 Specific assignment: § Find ways of extracting prior art

Determining Prior and Novel Art (1) • This work is currently done by experts.

Determining Prior and Novel Art (2) • Existing tools use statistical, data-mining techniques. §

Proposed Solution – Stage 1 • Statistical analysis & seed extraction: § Process the

Proposed Solution – Stage 1 • Tools such as KEA require initial training and

Proposed Solution – Stage 2 • Construction of Claims table: § Text is processed

[0] Grass TYPE: synthetic (WHAT) [1] Surface(s) TYPE: [0], support, playing (WHAT) are manufactured

Proposed Solution – Stage 3 • Creating claims once the table is complete is

Problems Major obstacles that needed to be overcome were: • How to determine prior-art:

Figure 6 – Top-level scheme Patent description is processed by KEA and the Sentence

Implementation of NLP Parts • A subgroup of the research team began working on

Stanford Parser – An Example "One implementation of the snapshot copy process provides a

Sentence Template Tool (1) • Motivation: In a single patent document authors often use

Sentence Template Tool (2) • Example from the US patent No. 6, 804, 755

Sentence Template Tool (3) • This sentence structure is typical for many patent descriptions,

Sentence Template Tool (4) • For example: "FIG. 10 is an exemplary block diagram

Advantages • Frequently used queries can be stored for later use. • If this

Future Plans • Use the results returned by Google, refine them by applying the

Future Plans (1) • This kind of analysis requires an enormous amount of CPU

Future Plans (2) • Possible solution: Each document should contain an additional metadata section,

Future Plans (3) • Our idea is still in the first stage of development.

CONCEPT MODELING: A Research Review Đorđe Popović Veljko Milutinović popajce@ptt. yu vm@etf. bg. ac.

Slides: 72

Download presentation

CONCEPT MODELING: A Research Review Đorđe Popović, Ognjen Šćekić, Veljko Milutinović IPSI Belgrade for SUN Microsystems January – December 2006.

What Is Concept Modeling? • A way of modeling reality: § Identifying concepts § Identifying relations among concepts § Organizing the concepts in a knowledge-base, allowing an "intelligent" way to search and process this data. • Why do we need concept modeling? To make electronic resources not only machineprocessable, but also machine-understandable! 2

Challenges • How to create a model that has a uniform structure, and is powerful enough to capture the essence of any concept? • How should these models be linked into an efficient structure? • How can we bridge the gap between natural language and a machine-processable model? 3

7 Ws – PROs and CONs Which When What Ultimate Goal: Why concept From a specific a general Where WHAT associations provide to general facts aboutmodel! any concept. Who (W)How 4

Why Start with Patents? • Described by a very formal, structured language – claims. • Each patent is a novel concept. • Definition of one patent is usually based on another one. 5

Structure of a Patent Document General info about the patent Description – primary not structured Reference sthe towell related patent s Claims target for What patent (can be used for. Abstract Which, –of When, Why, Where, Who and How) 6

Conceptual Indexing (1) • What is conceptual indexing? “New technique for organizing information to support subsequent access that can dramatically improve your ability to find the information you need, with less hassle and with better results. ” William A. Woods • Conceptual indexing combines techniques of: § Knowledge representation § Natural language processing § Classical techniques for indexing words and phrases • Bridges the gap between natural language and a machine processable model. 7

Conceptual Indexing (2) Conceptual indexing technology is a combination of: • Concept extractor Identifies phrases to be indexed. • Concept assimilator Analyzes a concept phrase to determine its place in the conceptual taxonomy. • Conceptual retrieval system Uses conceptual taxonomy to make connections between requested and indexed items. Figure 1 – Main components of a conceptual indexer 8

Hybrid Approach: Indices + RDF/OWL • Conceptual indices • RDF/OWL • Motivation: Use the advantages of one approach to eliminate the drawbacks of the other. 9

Conceptual Indices vs. RDF/OWL Major advantage s: Major drawbacks : Conceptual indices RDF/OWL ontologies Linear-complexity structures Very expressive and precise Provide basic subsumption relations Based on First-Order Logic Provide built-in knowledge on low-level concepts Supported by W 3 C Incapability of establishing explicit relations among high-level concepts Great complexity Incapability to create precise models 10

Why not Use Ontologies Alone? • If we want to use an ontology we have 2 choices: § Use an existing, well-established ontology that might not suite our § needs. Create a new ontology which does suit our needs: – We can create several different ontologies, depending on how we want to capture the information. – Problems arise when we want to merge ontologies. • This approach works fine within a closed community with specific needs: § There already exists a well-defined basic ontology structure. § Community members have a good knowledge of how to model new concepts in terms of the existing ones. 11

Why not Use Indices Alone? • For example, let us take the simplest possible definition, for a bird: bird 1 – a creature with wings and feathers that lays eggs and can usually fly. • Our index might then contain the following associations: creature, wings, feathers, eggs, fly. • A conceptual index does not offer the possibility 1 - Word definition taken from Longman Dictionary of Contemporary English, 3 rd edition, 12

Hybrid Approach (1) • An index of associations represents a simple model, similar to what humans have on their mind when they first think of a bird. • Having enough associations, one can create a model with a considerable degree of accuracy. • RDF/OWL statements provide a means for expressing additional (but very important) information (e. g. there are birds that cannot fly!) • We believe this is good enough for most applications. 13

Hybrid Approach (2) • It is important to keep track of how many times a term is mentioned, because it affects its descriptive power. § Example: U. S. Patent 6, 989, 179 – “Synthetic grass sport surfaces”, claims section: 1. synthetic grass 2. playing surface [9] [10] … • These terms represent the essence of what is being 14

Hybrid Approach (3) • However, this is only because we know what “synthetic grass” and “playing surface” are! At some level, we need to have some intrinsic, built-in knowledge-base of basic concepts! • All the other concepts can then be described in terms of these basic concepts. • Solution: Conceptual indexers are equipped with a knowledge base of basic terms. 15

Patent Model – Conceptual Index • A patent’s Claims section is scanned and processed by a conceptual indexer. • The result is a descriptive index, associated with the patent (it size is approx. 1 -5% of the full text). • This index can be seen as an ordered list of the patent’s WHAT associations (terms, phrases, sentence fragments). • An entry in the descriptive index contains a low-level concept, and the number of its occurrences. 16

Patent Model – RDF/OWL • For a different application, a different RDF/OWL model needs to be devised. • For describing patents this model could be used to capture explicitly stated information: • Patent number and other numbers ( WHICH) • Inventor, examiner, attorney, … ( WHO) • Date when the patent was filed ( WHEN) • Explicit references to similar patents ( WHICH) • etc… • Each W can have multiple sub-categories that are application-specific! 17

Patent Model – Creation Figure 2 – Creation of a patent model: Claims section is processed by the conceptual indexer to produce an index associated with the patent. Additional information about the concept is captured by RDF/OWL statements, into a predefined, application-specific structure. 18

Patent Model – Result Figure 3 – Patent model: WHAT associations are contained in a descriptive index. Other Ws are expressed through RDF/OWL statements. 19

Patent Model – Big Picture • Descriptive indices are re-processed by the Conceptual indexer, to form the system index. • Each entry in the system index retains links to the descriptive indices it originates from, and vice-versa. • This structure allows us to: § Perform quick searches of the existing patents § Add/remove patents easily 20

Figure 4 – Top-level scheme 21

Patent Model – Patent Relations Two ways of establishing relations among patents: • Via RDF/OWL statements, using automated reasoners § Problem: Referential integrity & Consistency • Via System index (implicit links) § Problem: Inexact, based on probability 22

Patent Model – Implicit Links (1) • Descriptions of similar concepts (patents) usually make a frequent use of similar or even same terms. • By determining overlapping terms we create dynamic, implicit links among similar concepts. • The number of such implicit links can be used to express similarity among concepts. • The algorithm for determining the similarity 23

Patent Model – Implicit Links (2) • For example: When describing two different vaccines we would probably make a frequent use of terms like: vaccine, inactivated antigens, immune response, etc. 24

Advantages & Drawbacks • Advantages § Reduced complexity (a great reduction of direct links between concepts) § Fast search and retrieval § Scalability (as the result of using indices) • Drawbacks § Use of indices implies loss of precision 25

Conclusion • Our idea is still in the first stage of development. • Its key advantages are: its general applicability and reduced complexity. • Further research is needed to explore the quality and feasibility of the proposed solution. • However, we expect that the combination of OWL/RDF structures and indices might produce a satisfactory performance/exactness ratio. 26

References § W. A. Woods, L. A. Bookman, A. Houston, R. J. Kuhns, P. Martin, S. Green, "Linguistic Knowledge Can Improve Information Retrieval", Proc. of the Applied Natural Language Processing Conference (ANLP 2000), Seattle, 2000. § O. Scekic, P. Bojic, "An Overview of OWL and its Role in Semantic Web Architecture", YU-INFO 06, Kopaonik, Serbia&Montenegro, 2006. § Boris V. Dobrov, Natalia V. Loukachevitch, Tatyana N. Yudina, "Conceptual Indexing Using Thematic Representation of Texts“, Scientific Research Computer Center of Moscow State University, Moscow, 1998 § S. Omerovic, D. Savic, S. Tomazic, "A Survey of Concept Modeling", Faculty of Electrical Engineering, University of Ljubljana, Slovenia (to appear). § William A. Woods, “Conceptual Indexing: A Better Way to Organize Knowledge“, Technical report, Sun Microsystems Laboratories, 1998. 27

CONCEPT MODELING: Revisited with Details A Proposed Hybrid Approach to Patent Modeling Đorđe Popović Veljko Milutinović popajce@ptt. yu vm@etf. bg. ac. yu Ognjen Šćekić ogi@cg. yu

Initial Assignment • January 2006 Initial assignment: Get acquainted with different ways of Concept Modeling, in general. • More specifically, explore the possibilities offered by RDF and OWL. • One of the ideas: Use the 7 Ws - 29

What is Concept Modeling? • A way of modeling reality: § Identifying concepts § Identifying relations among concepts § Organizing the concepts in a knowledge-base, allowing an "intelligent" way to search and process this data. • Why do we need concept modeling? To make electronic resources not only machineprocessable, but also machine-understandable! 30

Why Start with Patents? • Described by a very formal, structured language – claims. • Each patent is a novel concept. • Definition of one patent is usually based on another one. 32

Structure of a Patent Document General info about the Description Reference sthe to related s Claims –ofprimary targetpatent for What Abstract patent 33

Conceptual Indexing • What is conceptual indexing? “New technique for organizing information to support subsequent access that can dramatically improve your ability to find the information you need, with less hassle and with better results. ” William A. Woods • Conceptual indexing combines techniques of: § Knowledge representation § Natural language processing § Classical techniques for indexing words and phrases • Bridges the gap between natural language and a machine processable model. 34

Conceptual Indexing Conceptual indexing technology is a combination of: • Concept extractor Identifies phrases to be indexed. • Concept assimilator Analyzes a concept phrase to determine its place in the conceptual taxonomy. • Conceptual retrieval system Uses conceptual taxonomy to make connections between requested and indexed items. Figure 1 – Main components of a conceptual indexer 35

Hybrid Approach: Indices + RDF/OWL • Conceptual indices • RDF/OWL • Motivation: Use the advantages of one approach to eliminate the drawbacks of the other. 36

Patent Model – Result Figure 3 – Patent model: WHAT associations are contained in a descriptive index. Other Ws are expressed through RDF/OWL statements. 46

Figure 4 – Top-level scheme 48

Patent Model – Implicit Links • Descriptions of similar concepts (patents) usually make a frequent use of similar or even same terms. • By determining overlapping terms we create dynamic, implicit links among similar concepts. • The number of such implicit links can be used to express similarity among concepts. • The algorithm for determining the similarity 49

New Assignment • May 2006 Specific assignment: § Find ways of extracting prior art from previously filed patents. § Use the results to determine novel art in the descriptions of patents that have yet to be filed. § Generate new claims from newly found novel art, to be submitted for new patents. 51

Determining Prior and Novel Art (1) • This work is currently done by experts. • Requires great knowledge on the subject, and much time spent searching various databases of existing patents. • Both time-consuming and money-consuming! 52

Determining Prior and Novel Art (2) • Existing tools use statistical, data-mining techniques. § Very efficient and fast algorithms available for extracting relevant keyphrases. § But limited capabilities of establishing any other than basic relations among concepts. Usually undefined relations. § Problem: How to determine more complex relations among concepts to create claims (sentences)? • Solution: Additional Natural Language Processing (NLP) techniques required! 53

Proposed Solution – Stage 1 • Statistical analysis & seed extraction: § Process the text with a statistical analysis tool. (In our case KEA 3. 0) § The output of such tools is an index of relevant words/phrases – stages: keywords, associated 1. with a Three score. Statistical analysis & seed extraction 2. Construction of Claims table 3. Creation of claims § Ideally, by using a conceptual indexer the output would be a much more expressive “conceptual index”. § Composite keywords are turned into a single keyword and its descriptors. § Use empirical rules on word scores and composite phrases to determine the most relevant keywords, and declare them to be the seeds for further analysis. 54

Proposed Solution – Stage 1 • Tools such as KEA require initial training and tweaking to achieve maximum performance. • We trained KEA on a set of 12 relevant Sun’s patents. • All the seeds extracted once are kept in a database, to be at disposal later when needed. 55

Proposed Solution – Stage 2 • Construction of Claims table: § Text is processed once more to eliminate the sentences not containing any of the seeds. § Each seed is assigned an entry in the claims table, and its occurrences in the text marked with a unique marker. § The text is then analyzed sentence by sentence. § Each sentence is decomposed into its functional parts – subject fragments, object fragments, predicate fragments and different adverbial fragments. (NLP – the hardest part!) 56

[0] Grass TYPE: synthetic (WHAT) [1] Surface(s) TYPE: [0], support, playing (WHAT) are manufactured from s. g. panels [2] Panel(s) TYPE: [0] are placed side-by-side to form continuous support surface[1] form continuous support surface are formed of grass sections[3] are square OR rectangular have different color tones [3]. Section(s) TYPE: [0] are cut from grass panels [from 2] are sewn OR glued OR attached together by a hook and loop attachment in a criss crossed way to create a checkered pattern create checkered pattern are assembled with ribbons OR fibers lying in different directions [4]. Ribbon(s) TYPE: [2] lie in different directions are fibrillated to remove the grain directions (predicate) (WHAT) (predicate) (WHY) (predicate) (WHAT) (predicate) (HOW) (WHY) (predicate) (HOW) (WHAT) (predicate) (WHY) etc… Figure 5 – U. S. Patent 6, 989, 179 – “Synthetic grass sport surfaces”, Claims table (part of) 57

Proposed Solution – Stage 3 • Creating claims once the table is complete is • straightforward. Here are some of the created claims from the previously shown table: § A synthetic grass surface manufactured from synthetic grass panels. § A synthetic grass playing surface as defined in claim 1, § wherein said synthetic grass panels are placed side by side to form a continuous support surface. A synthetic grass playing surface as defined in claim 2, wherein said synthetic grass panels are formed of synthetic grass sections. • Generated claims are compared against prior-art 58

Problems Major obstacles that needed to be overcome were: • How to determine prior-art: § Concept classifier § Sentence Template Tool (NLP) • How to determine functional parts of a sentence: § Sentence Analyzer (NLP) 59

Figure 6 – Top-level scheme Patent description is processed by KEA and the Sentence template tool to extract relevant keywords (seeds). Seeds are compared against prior art contained in the database. NLP processing Claims table is created by analyzing sentences containing seeds. Generate new claims from the table. 60

Implementation of NLP Parts • A subgroup of the research team began working on the NLP tools. • After extensive research we adopted the Stanford parser as the base tool for our work. (http: //nlp. stanford. edu) • The parser analyzes single sentences. Its output is a tree structure showing types of words and sentence fragments. • It can also determine basic grammar relations. • Our plan: Use the first output to create the template tool, and both outputs to determine functional parts of a sentence. 61

Stanford Parser – An Example "One implementation of the snapshot copy process provides a twotable approach. " (ROOT (S (NP (CD One) (NN implementation)) (PP (IN of) (NP (DT the) (NN snapshot) (NN copy) (NN process)))) (VP (VBZ provides) (NP (DT a) (JJ two-table) (NN approach))) (. . ))) num(implementation-2, One-1) nsubj(provides-8, implementation-2) det(process-7, the-4) nn(process-7, snapshot-5) nn(process-7, copy-6) prep_of(implementation-2, process-7) det(approach-11, a-9) amod(approach-11, two-10) dobj(provides-8, approach-11) Grammar relations can be used to determine main functional parts of sentences. 62

Sentence Template Tool (1) • Motivation: In a single patent document authors often use the same sentence templates for describing various patent parts. • This tool allows the users to specify the sentence templates to find, and the parts they want extracted. 63

Sentence Template Tool (2) • Example from the US patent No. 6, 804, 755 : FIG. 1 is a pictorial representation of a distributed data processing system in which the present invention may be implemented; FIG. 2 is a block diagram of a storage subsystem in accordance with a preferred embodiment of the present invention; . . . FIG. 10 is an exemplary block diagram of a multi-layer mapping table in accordance with a preferred embodiment of the present invention; FIG. 11 is an exemplary illustration of Flex. RAID in accordance with the preferred embodiment of the present invention; . . . etc. • There are more than 20 sentences of the same structure in this patent description ! 64

Sentence Template Tool (3) • This sentence structure is typical for many patent descriptions, when the inventor is describing what the pictures represent. • Picture description sentences may contain important novel concepts. • Novel patents from already filed patents can be treated as prior art for the analyses of future patents. 65

Sentence Template Tool (4) • For example: "FIG. 10 is an exemplary block diagram of a multi-layer mapping table in accordance with a preferred embodiment of the present invention. " • The query that would return the underlined sentence part might look like this: ”Fig” * ”is” * <Noun. Phrase><Preposition><? : Noun. Phrase>*<. > • We developed a comprehensive query syntax for comparing parsed sentence trees, similar to the one shown here. 66

Advantages • Frequently used queries can be stored for later use. • If this tool is to be used primarily within a company, people working for the company can be given the guidelines on how to describe certain parts of the patent to facilitate and make more efficient the use of this tool. • The key advantage of this approach is that it is much more accurate than statistical tools, because it is controlled by the humans. 67

Future Plans • Use the results returned by Google, refine them by applying the semantic analysis and give immediate answers to user queries! • Users should be able to use the query syntax to specify not merely the keywords, but also require the terms to appear in a specified context, or ask specific questions. 68

Future Plans (1) • This kind of analysis requires an enormous amount of CPU time, and should therefore be performed only for specific searches: § Patents § Legal acts and documents § Newspaper and other archives § Deep internet search § etc. 69

Future Plans (2) • Possible solution: Each document should contain an additional metadata section, which would contain the parsed data from the plain text contained in it. • That way, documents that change rarely should be processed only once. • Additional storage costs should be outweighed by the increased search performance. 70

Future Plans (3) • Our idea is still in the first stage of development. • Further research is needed to explore the quality and feasibility of the proposed solution. • However, we expect to produce some interesting results . 71

CONCEPT MODELING: A Research Review Đorđe Popović Veljko Milutinović popajce@ptt. yu vm@etf. bg. ac. yu Ognjen Šćekić ogi@cg. yu Thank you !