UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS

UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS Natural Language Access to Relational Databases: An Ontology Concept Mapping (OCM) Approach Lawrence Muchemi Githiari P 80/80034/2008 Supervisor: Dr. Wanjiku Ng’ang’a ‘Viva’ for the Degree of Doctor of Philosophy in Computer Science © 2014

A Social Need for NL Querying of RDBs by Casual Users NLQ PROCESSING SOLUTION LANGUAGE INDEPEND. DOMAIN INDEPEND. GENERIC RDB CROSS-LINGUAL 12/5/2020 Ph. D Presentation by Lawrence Muchemi 2

Focus of Research q Relational Database Access using NL q It’s an active basic research issue under HLT. q‘An Information Access Problem’ in NLP. OF S FOCU RCH A RESE q. Specific Challenges Tackled: q Domain Independence q Language independence q Cross-linguality q. Robustness (of inputs) 12/5/2020 to d e d n ve te a es h c r s u e i o d s s stu text u b o i ed e t v w a e l n u Pr o p e-po trate r n p e r c y – o n g ) o b l co e o t w E on g Dbs ntic T a A m G e s ( rolo s (eg P e r i o g o ) l onto v. etc i n U d l Shefie Ph. D Presentation by Lawrence Muchemi 3

Problem Statement The Unresolved issue of NLQ Processing for RDB access is the main Research Problem addressed. 1. Main Challenge : Lack of a Language- & Domain independent (generic) approach that maps any given NL to Structured QL 2. Most DBs have to grapple with the issue of cross-lingual interaction; this problem is also addressed 3. Other challenges addressed include lack of Generic Models that; q Define the processes of concepts discovery from NLQ free text q Guide the Retrieval of concepts from RDB Metadata q Mapping Algorithms, SQ generator functions & other Heuristics 12/5/2020 Ph. D Presentation by Lawrence Muchemi 4

Objectives The main objective: q. Bring forth a novel approach & an architecture Devoid of language and domain dependence Also address other challenges within the state-ofthe-art (noted in problem statement). Specific Objectives: 1. Develop a suitable language & domain independent methodology for understanding un-restrained NL text. 2. Design an architectural model and algorithms thereof that facilitate access of data from DBs using English & Kiswahili as case-study languages. 1. Design Algorithms for parsing free NL text & data structure for holding the parsed queries. 2. Design Algorithms for extracting concepts from ontologies 3. Design matching Function; Design SPa. RQL generator Functions. 3. Develop a prototype upon which performance evaluations can be done. Ph. D Presentation by Lawrence 12/5/2020 Muchemi 5

Significance of Research • The solution is an important intermediate step in speech processing for RD access • A gold-standard modelsevaluation framework A Design devoid of language & Domain considerations Generic Researchers Developers Users “There is a renewed interest in catering for casual user NL text interaction”. Kauffman & Bernstein (2007). More direct interaction with casual users Contribution to NLDBA field: Providing A novel Language & Domain Independent Approach upon which NL interfaces to DBs can be built 12/5/2020 Ph. D Presentation by Lawrence Muchemi 6

1. Semantic Parsing 2. Logic Mapping 3. Ontology-based Approaches to Related problems REVIEW OF VARIOUS SCHOOLS-OF-THOUGHT TO STRUCTURED-QUERY GENERATION I. Semantic Parsing Approaches to SQL Generation Approaches to MR Generation • Probabilistic • M/c Learning • Statistical Common Cited Grammar for MR Definite Clause Grammar (DCG), example Combinatory Categorial Grammar (CCG), example Two Back-to-Back Classifiers 12/5/2020 • atieno, (λx. λy. loves(y, x), kamau atieno → NP; loves → (SNP)/NP; kamau → NP Synchronous Context Free Grammar (Sync. CFG) (Chiang, 2006) The Grammar above may be augmented with the lambda, λ notation Ph. D Presentation by Lawrence Muchemi 7

II. Logic Mapping Approaches q Logic Mappers, q. Token-matching – q. Tagged Tokens eg Dittenbach & Berger, (2003); and Popescu et al (2003); q. Finite State Transduction - Garcia et al, 2008] q. Phrase-Trees-Mapping – q Interlingua (Shin & Chu, 1998 ), q. Phrase-Trees Mapping using Templates (TTM)- Muchemi, 2008 q. Syntactic Trees Mapping – q. NL/SQL Syntactic Trees Mapping) – SVM-based learners- (Giordani & Moschitti, 2010) III. Related Ontology-based Approaches q. No reported literature on Relational Database access q. Access to other repositories (eg semantic web) q. Relies on Bag-of-Words with direct Tokens-mappings to an Ontology eg Querix (Kauffman et al, 2006) and Quest. IO (Tablan et al 2008) 12/5/2020 Ph. D Presentation by Lawrence Muchemi 8

Notable Issues with Above Approaches Machine Learning: • Challenges of portability (moving across domains), “… parser has to be trained on a corpus of questions speciﬁc to a db…. making portability a big issue” Popescu, et al (2003) • Cost of building corpus “…weakness of the approach is the cost of training corpus of natural language/ logical expression pairs” Minock et al (2008) • Low Accuracy due to back-to-back arrangement of classifiers Mapping Techniques: • Errors from Automatic Tagging (same as in semantic labeling – such as poor classification – manual tagging is costly. Resulting in relatively poor performance Ontology-based Techniques • Generic RDb access not well studied • Hence purpose of this Research Work Cross-cutting: • Language and Domain Dependence 12/5/2020 Ph. D Presentation by Lawrence Muchemi 9

Trends in Approach to NL Access Problem LR: Correlation between Degree of Structuredness of Repository & Preferred Approach Eg Semantic Web Data Eg Named Entity Entries Eg Plain Texts & HTML Text Time From above trends, It is deduced that the direction to which a generic NLDBA model should be sought is in the area of ontology concept mapping q The power of ontologies lies in their capacity to provide context for semantics within Resource Description Framework(RDF). 12/5/2020 Ph. D Presentation by Lawrence Muchemi 10

The OCM Conceptual Model • From Analysis of Literature, the following OCM Conceptual Model was designed. Graphic Image of Conceptual Model 12/5/2020 Ph. D Presentation by Lawrence Muchemi 11

Issues that Needed to be Tackled Before Realization of the OCM model 1. “Concepts” Modeling & Discovery Methods • “Explicit” & “Implicit concepts” discovery Models 12/5/2020 Decoding of schema data (no controlled vocabulary) q Design of a language & Domain- ind. “Ontology Concepts Deducing Algorithms” Ph. D Presentation by Lawrence Muchemi 12

Issues that Needed to be Tackled Before Realization of the OCM model 1. Design of Schemata 1. “Features Space Model (FSM) and 2. Gazetteer Model” 2. Cross-lingual Access Mapping Algorithm Need to Enhance the current state-of-the-art (Lexical-level, Keyword-based Matching method), LLKM. (Punyakanok, Roth, & Yin, 2004) Structured Query Generator q The query generator’s task is to organize ‘concepts’ into a structured query 12/5/2020 Ph. D Presentation by Lawrence Muchemi 13

Research Design… Double Water Fall Strategy NLQ CONCEPTS STUDY 1 2 3 Modeling of NLQ Concepts DB CONCEPTS STUDY Modeling of RDb Concepts 1 Design of NLP Concepts Discovery process Components Modeling Query Semantics Transfer Process (NLQ DSF SPa. RQL) Feature Space Modeling (FSM) Deciphering meanings from Schema Data 2 Design of Ontology Processing Components Design of Common Processing Components 1. Modelling ‘Concepts Re-construction’ 6 1. Concepts Mapping Algorithm. 2. A Structured Query-Generator fn 1. Assembly of Components to form OCM-based Architectural Model 2. Design of MAIN Algorithms & Heuristics 05 December 2020 5 Architectural Design 4 2. Schema Design (Gazetteer) 5 3 6 Development of Prototype Ph. D Presentation by Lawrence Muchemi 4 ll strategy a r e v O - D R ates that integr mponents o c t n e r fe dif y in a of the stud gical lo coherent & by way there g overall addressin roblem Research P 7 Evaluation & Benchmarking 8 14

Research Methodology NLQ Concepts Modeling NLP Components Design Concepts Modeling (RDB) Ontology Processing Components Design METHOD: Several Case Studies – 5 No. Two – primary data; Three – Join Processing 1 secondary data Components Design 5 -point case study research design strategy (Yin, 1994) used. Implementation Protocol adopted: Architecture Design one devt at MIT- (Zucker, 2009) Sampling : Stratified Random sampling approach Prototype Kernelization Technique used in The 5 -Point C/Study Design Development query decomposition 1. Research Questions 2. Make Propositions Evaluation & 3. Establish Analysis Rigor Benchmarking 4. Linking data to proposition 5. Criteria for interpretation 05 December 2020 Ph. D Presentation by Lawrence Muchemi 15

Research Methodology NLQ Concepts Modeling METHODS 2 Simulations in Test-bed Validation of Query Semantics Transfer Model : Quantitative & Qualitative RDB Concepts Modeling NLP Component s Design METHOD: Case studies 3 - Data collected from Questionnaires & internet based surveys Ontology Processing Components Design Join Processing Components Design METHOD: 5 Simulations in test bed Evaluation: Quantitative for Algorithm , Function & other Heuristics Experiments Sampled queries used Architecture Design Discussed in evaluation segment 05 December 2020 Ph. D Presentation by Lawrence Muchemi METHOD: 4 Simulations in test bed Evaluation (efficacy of the algorithm)- Experimental > 6 test databases Prototype Development Evaluation & Benchmarking 16

Research Questions Guiding NLQ Studies q Generative-Transformation Theory: All languages have the same Deep Structure Form (for similar sentences), but their respective Surface Structures Forms differ because of the application of different Transformations Rules. Noam Chomsky (1957) q DSF = Simple, Assertive, Declarative, and Active q This study is based on GT Theory BUT concentrates on Query Semantics transfer in NL Queries [as opposed to sentence transformations], 1. Can deep structure form (DSF) of queries be used in deducing the interrogative properties of NL queries? 2. What type of relationships exists between DSF and SPa. RQL queries and are they language and domain independent? 3. Are the processes (for conversion of SSF to DSF in NL queries) language and domain independent? 17 12/5/2020 Ph. D Presentation by Lawrence Muchemi

Case 1 & 2 Kiswahili & English Queries Kiswahili Queries Case Study • Pre-study survey: Group of Farmers Are they Potential users of NLQ-DB access system? q. Respondents were regular users of veterinary services- commercial scale • Case Study: Solicited potential queries (for use in a simulated db. ) q 25 information request areas per questionnaire q. Purposive sampling Method. q. Sample sizes were determined on the basis of ‘Theoretical Saturation’ q 625 questions were collected 12/5/2020 English Queries Case Study q. Queries contained in q Web interface that is maintained by the Uo. N MSc coordinator. q. Various e-mails q. Data collected was from the domain of students’ management, qprovided a domain variation with case 1) q. Queries collected were in English qprovided a language contrast to case 1) q 310 were collected questions? Ph. D Presentation by Lawrence Muchemi 18

Other Query Sets Used Name of Query-set 1 ch CASE 1: g n n e i B rk Queries a a M Dat 2 3 No of Description Questions Original Source Farmers 625 Poultry farmers queries -Kiswahili Muchemi, (2008) CASE 2: Uo. N MSc 310 Questions by Uo. N MSc students to Coordinator e Coordinator coordinator - English -mails CASE 3: ELF Queries 120 Originally created by Bootra to (Bootra, to MS Northwind. DB evaluate ELF on Microsoft 2004) Northwind-db ( at Virginia Commonwealth University –English 4 CASE 4: Computer 500 Database & queries for computer Tang & Jobs jobs used originally by Tang under Mooney, 2001 Ray Mooney for Ph. D work at Texas State University- English 5 CASE 5: Restaurant 250 Same as above but for restaurant Tang selection - English Total 12/5/2020 1805 & Mooney, 2001 Ph. D Presentation by Lawrence Muchemi 19

Findings 1: Prevalence of Transformation Rules Example of a formal Transformation Rule ‘Was a lot of water taken by the chicken? ’ ‘Chicken took a lot of water’ Aux – NP 2– V –NP 1 NP 1– V – NP 2 1. DAT= Agent Deletion eg ‘[Kuku] inataga kwa mda? ’ ‘inataga kwa mda gani? ’ 2. PT= Passive (from active to passive tense-deep to surface); eg ‘Je, vifaranga walikula chakula? ’ ‘Je, chakula kililiwa na vifaranga? ’ 3. DET= Deletion of Elements (eliminates excessive words ); eg ‘Jimbi na vifaranga walikula chakula? ’ ‘Jimbi walikula chakula na vifaranga walikula chakula? ’ 4. IT= Imperative Transf. (command) eg ‘nipe wanunuzi bora’ ‘Wanunuzi bora’ 5. CT= Coordination (two sentences are combined into one - surface) eg ‘Kuku walikula chakula kimeoza’ / ‘Kuku walihara’ ‘Kuku walikula chakula kimeoza kisha wakahara’ 6. AET= Addition of Elements (adds information such as ADJ &ADV) eg ‘kuku amehara’ ‘kuku mweupe amehara’ 7. NT= Negation Transf. ; eg ‘kuku amehara’ ‘kuku hajahara’ Conclusion: There are 7 most prevalent Generative Transformation. Rules 12/5/2020 Ph. D Presentation by Lawrence Muchemi 20

Findings 2 - Mapping NLQ Semantics DSF Semantics Sampling: Stratified Random Sampling Approach • Each population (query set) was divided into 12 strata (based on query type eg ‘who’, ‘when’, ‘what’ etc). – The size of each strata was determined by the Frequencies of Each Query type. – 50 Query samples were obtained from each of the 5 Populations; Total 250 Queries – The questions for each query type (forming the strata) were randomly selected from original population. • Semantics Transfer Analysis – 7 -step KERNELIZATION Method’ – for identification of Meaning Bearing Components (MBCs) Conclusion: S-V-O terms & Modifiers (adj, adv etc) are critical components in the transfer process 12/5/2020 Ph. D Presentation by Lawrence Muchemi 21

Finding 3: Does there exist a regular process in which the semantic of a query is transferred from the SSF to the MBCs? . q YES - Modeled as Query Semantics Transfer Model q Modeling done after semantics analysis of data stratified across all Categories of Transformation Rules The Query Semantics Transfer Model (Qu. Se. T Model) 12/5/2020 Ph. D Presentation by Lawrence Muchemi 22

Findings 4 to 7: Analysis & Validation of Qu. Se. T 4 What relationships exist between Meaning Bearing Components (MBCs)? 5 6 7 Deviations in Qu. Se. T Does transfer process occurs without deviation to Qu. Se. T? 1 st & 2 nd persons in a query do not bear direct semantic reference & can be dropped Qualitative Validation Does Query Semantics Transfer conform to the Qu. Se. T model? All 12 query types conformed WH-query is answered through substitution of the interrogative with a suitable MBC. Quantitative Validation of Qu. Se. T (Model built as a python module ) 12/5/2020 MBCs have a tri-partite relation; between Subject, Verb and Object OR, Any 2 of these components and an Interrogative (or modifier of either) OR Any of SVO and its modifiers A variable can replace any element within the tri-partite Swa: 23 of 25 NLQ analyzed correctly Eng: 24 out of 25 NLQ correct Mean Accuracy of the Qu. Se. T model was therefore determined as 94%. Ph. D Presentation by Lawrence Muchemi 23

Finding 8: What type of relationship exist between DSF and SPa. RQL? q MBCs have a tri-partite relation as observed from Qu. Se. T. Example Query: ‘What is the phone number of the customer whose ID is 1’ q There are 3 MBCs (Phone number; Customer; ID. What is an interrogative. q Only 2 triples (with 1 st element being a DB name) are possible. q They have the format: ? element 1 ? element 2 ? element 3, q There is a mention of a specific row value (instance ID= 1) and hence an addition FILTER clause ? customer ? phone_number ? Variable 1(what? ) ? customer ? id_number ? Variable 2 (what? ) FILTER (? id_number = "1") q Conclusion: MBC triples map directly onto SPa. RQL triples

Finding 9: What type of relationship exist between the formed SPa. RQL Queries and RDF? • Example of a Full SPa. RQL Query Used as Variable Triple (Blue/ Green) • OWL RDF-Ontology Derived from Microsoft’s DB ‘Northwind’ Variable completes Triple Instance • Conclusion: The formed SPa. RQL and RDF are based on TRIPLES and therefore can be mapped directly to yield answers 12/5/2020 Ph. D Presentation by Lawrence Muchemi 25

Findings 10: Word Length of Concepts Conclusion: • Optimal phrases’ length is 3 • It indicates the optimal number of words typically expressing most concepts (eg Collocations) and therefore guides any rule-based concept discovery process. 12/5/2020 Ph. D Presentation by Lawrence Muchemi 26

Survey on DB Schema Concepts Reconstruction • Challenge: No common Nomenclature exists – hence challenges in decoding schema information • Research: Answers the following questions based on collected data, 1. Is there a finite set of patterns that database schema authors’ use in representing database schema object names? 2. ‘How can we decipher the meaning of an ‘intended concept’ from the schema name? 3. How can a general ‘Concepts Re-construction Algorithm’ be built from an ontology created from a relational database source? 12/5/2020 Ph. D Presentation by Lawrence Muchemi 27

The Case Study & Some Findings Case Study Data Collection: ü Questionnaires & ü Internet Based sources Sample Frame ü 12 Training Institutions ü 16 Software Devpt firms. ü 320 Randomly Sampled db schema objects Sampling: Snowball Analysis: ü Descriptive Methods ü Pattern Identification End Product ü Generic Algorithm for lexicon & Concepts Reconstruction 28

Conclusions in the DB Nomenclature Studies q CONCLUSION 1: Finite set of patterns? 10 Common naming formalisms Categorized into 3 clusters shown below High Usage Medium Usage Low Usage under_score; Abbrev. ; do. t; Finger_Breaking; da-sh; ‘string’; Pascal. Method; Acronyms camel. Case SCREAM q CONCLUSION 2: ‘Deciphering the meaning of an ‘intended concept’ ? – DB developers rarely give names that do not have meaning. – These meanings highly correlate with the intended concept. q CONCLUSION 3: ‘‘Concepts Re-construction Algorithm’ ? – A general Words Recreation Algorithm can be defined Recreates words from ontology derived from DB

The OWo. RA Algorithm -(Ontology Words Recreation Alg. ) Function retrieves schema elements from ontology and forms a List Different functions handle various patterns found in the strings Split compound Strings & do Stem Identify lexicon & synonyms & form Associated concepts (Phrase chunks + other categories) More Details of the Algorithm found in Thesis Document

Evaluation of the OWo. RA Algorithm • Aim: Experimentally determine the efficacy of the OWo. RA algorithm • OWo. RA was subjected to 6 test databases as shown below – No. of columns positively identified (lexically & semantically), expressed as a % of the total No. of columns. – CONCLUSION: Results show a mean Accuracy of 92. 5%

Components Design & Construction Populated from OWo. RA 1. Structure of Gazetteer Model Translated Need for Manual Translation 12/5/2020 Items to be Ph. D Presentation by Lawrence Matched to FSM Muchemi 32

2. Feature Space Model (FSM) Design: Experimental Investigation-Stem OR Lemmatize? Example of Lemma (Roots) & Stems • • • Surface form Kuku wakitetemeka ni wagonjwa? Lexical form [kuku] [tetema] [ni] [mgonjwa] ( [are] [chicken] [shiver] [sick]) Stemmed form [kuku] [tetem-] [ni] [-gonj-] {prefixes: wa-ki- for tetem-; -eka for tetem-} • Methodology: Simulations on test bed – English: Lancaster stemmer (Paice, 1990) vs NLTK Word. Net lemmatizer (Miller, 1995) – Kiswahili: Lexical d. B constructed from TUKI dictionary Higher – Results show that STEMMING results in higher recall value, – DECISION: Store stemmed word forms Ph. D Presentation by Lawrence Muchemi

WHAT TO STORE in FSM? Bag of Words (Bo. W) OR CP (Concept Patterns)+ Bo. W? Methodology: Simulations of various configurations on Test Bed Results: Higher Summary: Store All as Stemmed CP+Bo. W 1. Nouns and Noun Phrases Identified through patterns eg Kiswahili Noun patterns by Ohly (1982): Norminalized verbs, Deverbative head with Noun complement, Combination of nouns, Noun and adjectives, Nouns with –a connector & Nominalized verbs. 2. 3. 4. Terms & Collocations identified thro patterns eg Kiswahili Term patterns (Sewangi, 2001) Other Phrase Chunks – Verb-ph, prepositional–ph Synonyms, Hypernyms 12/5/2020 Ph. D Presentation by Lawrence 34

Structure of the Feature Space Model (FSM) Items to be Matched to Gazetteer 12/5/2020 Ph. D Presentation by Lawrence Muchemi 35

3. Matching Function in OCM REVIEWED STRATEGIES (from ontology matching): (Keshavarz & Lee, 2012) 1. Lexical-based strategy; 2. Semantic-augmentation strategy; 3. Constraint-based, 4. Instance-based strategy; 5. Structure-based matching models From Document retrieval strategies: 1. Boolean, 2. Vector Space, 3. Probabilistic, and 4. Language models (Liddy, 2005) • Semantic Augmentation model borrowed from Ontology-Matching Strategies used; – Augmentation = Integrating lexicon with meaning of words - from lexical dbs – Developed Semantically Augmented Concepts Mapping (SACo. Ma) fn – Levenshtein algorithm (lexicon-based, Edit-Distance Calculation) was enhanced through semantic augmentation. 12/5/2020 Ph. D Presentation by Lawrence 36

A Python Implementation of Matching Function More Details of the Algorithm found in Thesis Document The function Maps concepts in FSM to those in the Gazetteer) 12/5/2020 Ph. D Presentation by Lawrence Muchemi 37

4. Heuristics Examples 1. Handling Foreign Key: When dealing with two or more tables related via a foreign key a heuristic was developed from collected Data. This is stated as follows, “when 2 or more classes are involved in reply to a query, we introduce a triple from each participating class. The triple introduced must have the common property that originally constituted the foreign key”. 2. Handling Implicit Concepts: • This is discovery of IMPLIED ontology concepts. Heuristic: “If an instance is mentioned, it’s property is implied. ” Example: The sentence “Which products comes in bottles” 1. “Bottle” is an instance of “Categories” Table thro’ property “Description”, thus even if “Description” is not in the original sentence, it is an implied concept, Validation: Validated by analyzing 20 relevant queries for each heuristic Observation: Heuristics found to hold true in all (Results in appendix 11) Conclusion: Heuristics are dependable and therefore implemented 12/5/2020 Ph. D Presentation by Lawrence Muchemi 38

5. Structured Query Generation Process Sample Sentence: “Give me the names and identification of suppliers from central s region? ” C B Y M F I T d • Meanings Base Components: IDEN ssociate & A ples Give(SELECT query); Me (dropped); the (dropped) Tri names; identification; suppliers; central; region; Compound terms (concepts) > Supplier+Id; Company+Name) • Triples formed (Possible permutations) Database (Class) Field (Property) Instance/Variable Suppliers Supplier. ID No Instance /Var. Suppliers Company. Name No Instance /Var Suppliers Region “central” 12/5/2020 Ph. D Presentation by Lawrence 39

Structured Query Generation Algorithm • General SPa. RQL Query Assembly Heaping Procedure 1 2 Predefined URI Identified Property objects 3 Identified Triples 4 Identified Filters NB: > FILTER is necessary for instantiating a property value > Applied where there is direct mention of an instance & a Property • The Generated SPa. RQL Query 40 12/5/2020 Ph. D Presentation by Lawrence

PUTTING ALL COMPONENTS TOGETHER Published as a Book Chapter - Springer Lecture Notes in Computer Science (LNCS 2013) (Muchemi & Popowich, 2013). The OCM-based Architectural Model (Ontology-based NL Access to DBs (ONLAD) 12/5/2020 Ph. D Presentation by Lawrence Muchemi 41

Overall OCM Algorithm Knowledge Comprehen sion Concepts Discovery Query Assembly Query Execution 12/5/2020 Ph. D Presentation by Lawrence Muchemi 42

Examples of Parsed NLQs • Sample Translated Queries EXPECTED SPARQL RESULTS Query 1 “Give me the cities where employees come from? ” PREFIX moon: <http: //www. owlontologies. com/New. Northwind#> SELECT ? employees ? City WHERE { ? employees db: City ? City. } Fig. 1 One Table Example Query 2 “Which products come in bottles? ” Fig. 2 Two Table Example Note: Foreign Key PREFIX chema: <http: //www. owlontologies. com/New. Northwind#> SELECT DISTINCT ? Product. ID ? Product. Name ? Description ? Category. ID WHERE { ? products db: Product. ID, ? Product. ID, ? products db: Product. Name? Product. Name. ? products db: Category. ID ? Category. ID. ? categories db: Category. Id ? Category. ID. 2 triples one from each participating class ? categories db: Description ? Description. and both having a common property (field) FILTER( ? Description = "bottled") } 12/5/2020 Ph. D Presentation by Lawrence Muchemi 43

EVALUATION OF OCM MODEL Evaluation Process NLQ Concepts Modeling (RDB) NLP Components Design ts e s ery sly u 5 q viou ed pre scrib de ed us Ontology Processing Components Design Join Processing Components Design -NO Separation of Questions: -Gives true reflection of expected performance Architecture Design EVALUATION FRAMEWORK Evaluation & Benchmarking Novel Evaluation Framework Devpd from Literature analysis 05 December 2020 Prototype Development Ph. D Presentation by Lawrence Muchemi 44

The Test-bed Oveview 1. 2. 3. Tri-gram Language Detector NLTK Normaliser & Tokenizer Google Translate TOOLS & RESOURCES Concepts Generation – 1. Phrase Chunker – NLTK Reg. Exp (Eng) 2. Reg. Exp chunker + Swahili Patterns [sewangi, 2001] 1. Lancaster stemmer/Lemmatizer (English) 2. Lexical DB (Swahili) developed from TUKI, SALAMA POS Tagging – 1. Combined Trained Unigram and Trained Brill Tagger Rapid Prototyping Approach Simulations in test bed 12/5/2020 Ph. D Presentation by Lawrence Muchemi 45

A Detailed look at the OCM Based Prototype OCM Prototype Overview 12/5/2020 , Data é g é t o r rces: P s native RDF u o s e é’ R Protég p Server, , r e t s ma r, Wam e n o s Rea ripting c s n o Pyth Ph. D Presentation by Lawrence Muchemi 46

Databases Used for Evaluation Name Database 1 2 3 4 of No of Description Tables Chicken Farmers_db 8 No. Of Queries DB created to mimic the one at Thika poultry farmers’ project. Also reported in Muchemi, (2008) Uo. N MSc 4 Coordinator_db DB created to mimic students’ management at SCI, Microsoft’s Northwind_db 8 Standard database shipped with Microsoft’s database Restaurants_db 7 University of Nairobi. 200 120 server DB whose schema is described in Tang & Mooney, (2001) 250 and has been quoted widely in experiments 5 Computer Jobs_db 4 DB whose schema is described in (Tang & Mooney, 2001) 200 and has been quoted widely in experiments Sampled from same Queries used in NLQ Case Studies q. Sampled Query Sets Used for Evaluation q. A stratified random sampling approach q 8 strata based on complexity of queries – Defined in Tablan et al. (2008) as number of concepts per Query q. Diversity of queries ensured by different types eg ‘where’, ‘when’ 12/5/2020 Ph. D Presentation by Lawrence Muchemi 47

Experimental Determination of Mean Performance of OCM Model q Procedure: For TESTING q Subject the Sampled Queries to OCM q 4 Research Assistants were used to perform the tests q Procedure: For EVALUATING q Test & Categorize results. q 4 Human evaluators Examined & Categorized the answers generated q The evaluators were recruited from undergraduate CS students at Uo. N. q Answers were Categorized as ‘true positive’, ‘false positive’ or ‘neg’ (no answer generated). q Procedure repeated with TTM models for practical comparison. q Training: Research Assistants and Evaluators were given basic training on handling input and output responses from the prototype 12/5/2020 Ph. D Presentation by Lawrence 48

Parameters in the Evaluation Framework • Evaluation Framework has 8 Aspects, q 4 quantitative measures namely 1. Precision, 2. Recall, 3. Accuracy and 4. F-score q Four qualitative measures namely 1. 2. 3. 4. Domain independence, Language-independence, Support for Cross-linguality and Effect of Query Complexity on Model. Note: Design of Evaluation framework & parameters considerations was guided by literature review & constitutes a ‘gold-standard tool’ readily usable by other researchers 12/5/2020 Ph. D Presentation by Lawrence 49

Results: 1 Test-Set out of the 10 Test 1: Model = OCM; Levenshtein gap, µ = 0 & then changed to µ = 1 – Ie. perfect matching of strings within the gazetteer and the FSM OR an allowance of either 1 insertion, deletion or substitution of a single character Test 2: Repeat Tests above BUT Change Model to TTM. • 10 test sets done in total Sample Results: Test Set 1& 2 - OCM & TTM - Kiswahili Queries 12/5/2020 Ph. D Presentation by Lawrence Muchemi 50

Summary of Results from the 10 Evaluation- Sets • Results indicate a model whose – Average precision at a Levenshtein distance µ, of 1 is 0. 75 – This increases to 0. 86 on decrease of µ to 0. • Accuracy marginally increases from 0. 52 to 0. 53 on decreasing µ 12/5/2020 Ph. D Presentation by Lawrence Muchemi 51

Effect of µ on Precision, Recall and F-Score • Precision decreases with increase of µ while Recall increases. • F-score, the harmonic mean of precision and recall remains stable at 0. 72. • Its true that “The higher the precision, the better the quality of the answers received “ – Thus based on precision alone µ should be restricted to 0 • Recall shows the range of questions handled. “The higher the recall the better the range”, – Thus based on recall, µ should be set to 1. Levenshtein Distance (within Matching fn)

Experimental Determination of Domain Independence q Make Querying Language same as Schema Language (No Crosslinguality) q Test the 4 Domains One at a time 1. 2. 3. 4. Trading, Job-Search Student-Management, Finding-Restaurants. PROCEDURE • Determine tp, fp & neg. • For each domain calculate and tabulate, – Accuracy, Recall, Precision, FScore • For each of the 4 domains calculate – Mean (χ), Variance (ύ) and Std Deviation (σ) • Perform Outlier Analysis (Peirce Criterion (Ross, 2003)) English Experimental Procedure for Domain Independence Experiments – If minority of the points are classified as non-outliers, THEN we conclude that the model is NOT significantly affected by a change in domain, hence domainindependent.

Domain-Independence Analysis (x-mean) | | q Apply The Peirce Criterion (Ross, 2003) on all points > 1. 00 q The parameter, R was obtained from the Peirce’s table for a four data point-one outlier condition; R= 1. 383 q Determine S; S= R x σ = 1. 383 x 0. 021 = 0. 02904 (eg first row) q Determine Rmax; Rmax= (|xi- xm|max)/σ| = Max value for each row above. q If S>Rmax, then the data is classified as an outlier, else normal q No data was found to be an outlier, HENCE the conclusion that the model is DOMAIN-INDEPENDENT 54

Experimental Determination of Language Independence q Evaluations done for, q English, q Kiswahili q Set Querying Language SAME TO Schema Language PROCEDURE • For each LANGUAGE determine, –True Positives, False Positives and No results (neg). • Calculate and tabulated the following, –Accuracy, Recall, Precision. –Mean (χ), Variance (ύ) and Std Deviation (σ) • Analysis done –Deviation Analysis – AS DESCRIBED IN PREVIOUS EXPERIMENT –Outlier Analysis (Peirce Criterion (Ross, 2003)) - AS DESCRIBED • No data was found to be an outlier, HENCE the conclusion that the model is LANGUAGE-INDEPENDENT Procedure for Language Independence Experiments Ph. D Presentation by Lawrence Muchemi 12/5/2020 55

Experimental Determination of Cross-lingual Support q 4 Experiments done, q Swahili Queries - Swahili DB q English Queries- English DB q Swahili Queries- English DB, PROCEDURE • Set each of the 4 arrangements. –Determine true positives, false positives and no results. • For each arrangement calculate and tabulated, –Accuracy, Recall, Precision. –Mean (χ), Variance (ύ) and Std Deviation (σ) Analysis done –Deviation Analysis – AS DESCRIBED IN PREVIOUS EXPERIMENT –Outlier Analysis (Peirce Criterion (Ross, 2003)) - AS DESCRIBED • No data was found to be an outlier, – HENCE the conclusion that the Experimental Procedure for Cross-lingual Support Experiments model HAS GOOD SUPPORT FOR CROSS-LINGUAL QUERYING

Effect of Query Complexity P e r f o r m a n c e Query Complexity Conclusion: Model performs best with at least 2 concepts per query with the peak occurring at 3 to 5 concepts and then gradually degrades. 12/5/2020 Ph. D Presentation by Lawrence Muchemi 57

Comparative Analysis with other Models q. Benchmarking done thro’ comparison with other published works Best in Category 12/5/2020 Ph. D Presentation by Lawrence Muchemi 58

COMPARATIVE ANALYSIS …/2 q. M/L models require 2 back-to-back learners q. Superimposing SQL converter qeg {Giordani & Moschiti, (2010) FScore = 0. 759} qto say WASP (F-score =0. 81), q the overall DB-Access F-Score would be (0. 81 x 0. 759) = 0. 615 which is lower than OCM’s 0. 72. Machine Learning q. PRECISE achieves F-Score of 0. 65 (NO Query pre-selection) which is lower than OCM’s 0. 72 q. PRECISE uses Bo. W compared to OCM which uses ‘Concepts’ (tokens, phrase chunks, terms and collocations) q thus explaining better Recall for OCM. (0. 70 compared to 0. 55) {without Query pre-selection} OCM P=0. 86; R=0. 7; A=0. 53; F= 0. 72 Logic Mapping Ontology Access q. Direct comparison not suitable because tasks are different. q. Querix access specific ontologies (GATE) while OCM is a generic RDB access model. In general, q OCM has better Precision than Querix (0. 86 compared to 0. 78) q. However, Querix has a user feedback Intervention q which assists in guiding the questions posed by the user q. This explains Querix’s better Recall (0. 78 comp. to 0. 70). q. But in absence of this performance would be same as Questio’s (0. 68) because of similar linguistic processing (Bo. W); q. OCM’s good performance can be attributed to the different query linguistics handling (Concepts Modeling)

Conclusions q. The developed Models are language and domain independent (shown experimentally) q. Reason: The underlying concepts are based on universal language processing theories such as q Generative-transformation, q Phrases, Terms & Collocations formation Theories, q MBC Identification model (dev here). q The main point of departure of the OCM (in terms of linguistic processing) from other models in ‘Ontology-based solutions’ is in ‘Concepts Formulation’ q “The good results therefore indicates that the use of concepts, arising from a concepts-modeling process, as opposed to bag-ofwords leads to a better performance as shown”. q Draw-Back: OCM requires someone to enter information that at times is regarded as obvious or superfluous. This leads to lower recall 12/5/2020 Ph. D Presentation by Lawrence 60

THEORETICAL CONTRIBUTIONS 1 New Approach, “The OCM Approach”, Facilitates conversion of NLQ into structured queries (SPa. RQL) 2 “Semantics Transfer Model” (Qu. Se. T) Models semantic transfer 3 Design of a generic Algorithm the OWo. RA Models Reconstruction of Ontology Words 4 Framework for reviewing the Trends to NL DB Access Approaches Theoretical Contributio ns 5 Extension of ideas postulated by Chomsky (1957) that “DSF of a query can be used in deducing the interrogative properties of a NL query and that this property is domain and language independent. ”

TECHNICAL CONTRIBUTIONS Architecture for Ontology-based NL Access to DBs (ONLAD) 2 Creation of 2 Standard Reusable Research DATASETS (queries & databases) Kiswahili dataset: (Farming) English dataset: (Students’ queries management Domain) 3 1 Technical Contributi ons q. Heuristics for discovery of implicit concepts q. Semantically-Augmented Concept Matching (SACo. MA) function q. Heuristic for handling foreign keys Implementation of theoretical principles into practical contributions Implementation of Kiswahili Terms & Collocations discovery Methodologies Sewangi (2001) into concrete practical contribution 4 OCM Components & Related Heuristics q. Heuristic for SPa. RQL query generation q. Feature Space Model q. Gazetteer Model

3. METHODOLOGICAL CONTRIBUTIONS 1. Framework for Performance Evaluation: “The 8 -parameter evaluation framework” 2. Procedures for Evaluation of Qualitative Parameters: q Domain Independence, Language Independence, Cross-lingual Querying Capacity, Effect of Query Complexity 4. Achievements on Performance Advancement a. Good performance values comparable to the State-of-the-art b. Attainment of Domain Independence c. Attainment of Language Independence d. Achievement of Cross-lingual Querying Recommendations for Further Work 1. Scalability Study to multiple databases 2. Discourse Processing Study 3. Application of OCM to Object-Oriented Databases 12/5/2020 Ph. D Presentation by Lawrence Muchemi 63

Relevant Publications, Conferences & Projects BOOK CHAPTERS Muchemi, L & Popowich, F. (2013). An Ontology-Based Architecture for Natural Language Access to Relational Databases. Springer Lecture Notes in Computer Science. HCI (6) 2013: 490 -499 Vol. 8009 2013. Las Vegas, USA. ISBN 978 -3 -642 - 39188 -0 JOURNAL PUBLICATION CONFERENCE PROCEEDINGS Muchemi, L & Popowich, . (2013). NL Access to Relational Databases: The OCM Approach. Proceedings of 7 th International Conference, UAHCI 2013, Las Vegas, NV, USA, July 21 -26, 2013, Proceedings Part I. Muchemi, L (2008). Swahili NL Access to RDbs (TTM Approach). Proceedings of 4 th ICCR Conference. Makerere Univ. , Kampala, Uganda, August 2008 Muchemi, L, Getao K. 2007. Enhancing Citizen-Government Communication Through Natural Language Querying. Proceedings of 1 st International Conference in Computer Science and Informatics (COSCIT 2007). : 150 -154. , Nairobi, Kenya Uo. N - MSc & BSc STUDENTS PROJECTS APPLYING CONCEPT DEVELOPED IN THIS WORK 1. Kiilu, E. (2014). NL Access to Kenya Open Data. BSc Project Report, Uo. N, Kenya 2. Kihuna, M. (2013). Accessing Wikipedia data Using NL. BSc Project Report, Uo. N, Kenya 3. Ikunyua, E. (2012). Automatic Characterization of Named Entities in Structured Reports. MSc Project Report, Uo. N, Kenya 12/5/2020 Ph. D Presentation by Lawrence Muchemi 64

Acknowledgements 1. Thanks to almighty God for journey mercies. Ph. D is a journey which is long with many meanders 2. Supervisors 1. Dr. Wanjiku Ng’ang’a – Uo. N q Good high Quality Supervision 2. Prof. Fred Popowich – Simon Fraser University, Canada – q 6 -Month UNPAID Ph. D supervision in Canada q Useful insights 3. Dr. Kate Getao q Original Ph. D Concept. 3. Phd Examination Panel – q Providing me this opportunity to make this crucial milestone 4. SCI admin& CBPS, led by Prof. Okelo-Odongo & Prof Aduda q For their support 5. Research Committee members especially >>Prof. Waema, Prof. Omwenga, Prof. Waiganjo, Dr. Opiyo among others q For immeasurable motivation 6. Colleagues, Friends & ALL present here for this viva. 12/5/2020 q. THANK YOU ALL Ph. D Presentation by Lawrence Muchemi 65