Converting Natural Language to SQL statements Andrew Bury

Converting Natural Language to SQL statements Andrew Bury & Yunzhe Liu Directed by Dr. David Lin

“If we are to satisfy the needs of casual users of data bases, we must break through the barriers that presently prevent these users from freely employing their native languages. ” ~Ted Codd [1974] Codd (the inventor of the relational database) when speaking on a dialog based application he envisioned to allow for everyday users to interact and communicate with databases SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

SCHOOL OF ENGINEERING AND COMPUTER SCIENCE Why Is It Important? Enable end user to directly interact with databases

Related Fields Helps on related fields – Application Database Free Design Question Answering SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Related Fields Helps on related fields – Realtime access for information validation SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Core concepts • Relational Databases • SQL vs. Natural Language SQL Natural Language Mathematically Definable Ambiguous, multiple structures have similar meaning Hard to understand for Humans Comes naturally to Humans Interact with databases, programing language Interact with people Takes extended effort for humans to learn Is one of the beautiful aspects of humanity SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Core concepts Na. LIR: Li, Fei and H. V. Jagadish(2014) • Prompt user for clarification, and interactive system SQLizer: Yaghmazadeh, etc (2017) • Query sketch and refinement (idea used in part 2) Type. SQL: Yu, Tao, Zifan Li(2018) • A sequence to sequence model with an emphasis on data type (idea used for part 1) BERT: Devlin, Jacob (2018). • Self attention with LSTM (idea used for part 1) SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

SCHOOL OF ENGINEERING AND COMPUTER SCIENCE Na. LIR: an interactive system

SCHOOL OF ENGINEERING AND COMPUTER SCIENCE SQLizer: Query sketch and refinement

SCHOOL OF ENGINEERING AND COMPUTER SCIENCE BERT & TYPESQL: Neural Network with attention

Key Problems • How can we map user intent to actual clauses in SQL? • How can map the clauses in SQL to actual attributes in a database schema? • Once we have a collection of possible attributes from the schema, how can we restructure those into an actual SQL statement? SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Key Observation There seem to be parts of English questions that hint toward certain clauses in SQL statements, without any prior knowledge of a pre-existing database English: Who are all the people that live in Waco and are over 25 years old? SQL: SELECT people FROM Residents WHERE City = ‘Waco’ AND Age > 25 If there were a way to automatically detect what parts of an English question correlation to the corresponding clause in SQL, translation becomes much less difficult. SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

English: List the total number of horses on farms in Texas with ascending order. What is the total horses for each farm in Texas, sorted ascending? What are the amount of all of the horses on Texas farms, sorted alphabetically? SQL: SELECT Total_Horses FROM farm WHERE state = ‘TX’ ORDER BY Total_Horses ASC SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

This is Our System: Squire

Introducing Squire a novel approach to bridging the gap between natural language and SQL 2 step approach: • Step 1. Convert the English question into an Intermediate SQL language that is similar to SQL but with no prior database knowledge – Some key features of the English statement hint towards what a user expects from a database – Method to distinguish what the user expects from variables of output vs conditions that must be met about said variables • Step 2. Parse the Intermediate SQL language to SQL – Use similarity between words – use attributes in the database SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Example of Intermediate SQL Language English: What are the number of working horses on farms with more than 5000 total number of horses? Intermediate SQL Language: Select number of working horses, where farms, total number of horses > 5000 SQL: SELECT COUNT(workhorse) FROM local_farms WHERE num_horses > 5000 SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Part I: convert natural language to an Intermediate SQL language • Define a new language using English words but in a better structure that a programmatic parser can understand – Eliminate ambiguity, redundant words, find the target and conditionals • Manually label a large dataset to Intermediate SQL language • Use a similar method that has been implemented by others for Language to Language translation (Example: English translation to Chinese) SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Grammar of Intermediate SQL Language S’ -> S S -> <Select> <Where> Select -> S <Select> S -> <Noun> of <Noun_Indicator> S -> <Aggregation> <Noun> S -> <Noun> in <Noun_Indicator> SCHOOL OF ENGINEERING AND COMPUTER SCIENCE Where -> W <Where> W -> <Noun> W -> each <Noun> W -> <Noun_indicator> W -> W of W W -> W in W T -> descending T -> ascending …………

Part II: convert Intermediate SQL language to SQL statement using knowledge about the database • Parse Intermediate SQL language • Use query sketch with refinement (similar idea as SQLizer) – Use information from attributes in the database • It should be noted that we use vectors to calculate how similar the word in the English question is to a column name in the database the closer the vectors are, the more likely a word corresponds to a certain column in the database SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Word 2 vec: trying to understand word from a vector perspective King – Man = Queen – Woman These Vectors should be somewhat equivalent SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Data Sets • Wiki. SQL: A large crowd-sourced dataset for developing natural language interfaces for relational databases. – Large – Relatively easy – Zhong, Victor, Caiming Xiong and Richard Socher. “Seq 2 SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. ” Ar. Xiv abs/1709. 00103 (2018): n. pag. • Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task – About 7000 data – 3 difficulty levels, queries can be very challenging Yu, Tao, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang and Dragomir R. Radev. “Spider: A Large. Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. ” EMNLP (2018). SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Test Experiment In order to test the accuracy of the Intermediate Language that we have created, we came up with a set of tests on the Spider dataset to assess how accurately we can map the Intermediate Language to SQL. Example Queries that were tested had the following SQL clauses: - SELECT - WHERE - AGGREGATES - ORDER BY - FROM SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Results From Random Sample of 50 Intermediate SQL Statements Within First 3 Suggestions Within First 1 Suggestion Select Statements 72% 22% Aggregates 80% 58% Where Statements 52% 36% Order By 96% 90% From Statements 100% 74% SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Future Goals • Train a model to generate Intermediate SQL language – Our model was trained on a small dataset and because of so is overfit to what we trained on(if it comes across a sentence structure it has not seen it incorrectly labels the data) • Solution – Label all statements in the spider and Wiki. SQL datasets to create a near perfect translator (this takes a lot of effort) • Application in related fields – Question answering – Semantic parsers SCHOOL OF ENGINEERING AND COMPUTER SCIENCE

Questions? Andrew_Bury@baylor. edu Yunzhe_Liu@baylor. edu SCHOOL OF ENGINEERING AND COMPUTER SCIENCE