Automatically Synthesizing SQL Queries from InputOutput Examples Sai

  • Slides: 47
Download presentation
Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work

Automatically Synthesizing SQL Queries from Input-Output Examples Sai Zhang University of Washington Joint work with: Yuyin Sun

Goal: making it easier for non-expert users to write correct SQL queries • Non-expert

Goal: making it easier for non-expert users to write correct SQL queries • Non-expert database end-users – Business analysts, scientists, marketing managers, etc. can describe what the query task is do not know how write a correct query This paper: bridge the gap! 2

An example Table: student Table: enrolled Output table name stu_id course_id score name MAX(score)

An example Table: student Table: enrolled Output table name stu_id course_id score name MAX(score) Alice 1 1 504 100 Alice 100 Bob 2 1 505 99 Charlie 88 Charlie 3 2 504 96 Dan 3 501 60 3 502 88 3 505 68 4 Find the name and the maximum course score of each student enrolled in more than 1 course. The correct SQL query: SELECT name, MAX(score) FROM student, enrolled WHERE student. stu_id = enrolled. stu_id GROUP BY student. stu_id HAVING COUNT(enrolled. course_id) > 1

Existing solutions for querying a database • General programming languages + powerful − learning

Existing solutions for querying a database • General programming languages + powerful − learning barriers • GUI tools + easy to use − limited in customization and personalization − hard to discover desired features in complex GUIs 4

Our solution: programming by example Table: student Table: enrolled Output table name stu_id course_id

Our solution: programming by example Table: student Table: enrolled Output table name stu_id course_id score name MAX(score) Alice 1 1 504 100 Alice 100 Bob 2 1 505 99 Charlie 88 Charlie 3 2 504 96 Dan 3 501 60 3 502 88 3 505 68 4 SQLSynthesizer SELECT FROM WHERE GROUP BY HAVING name, MAX(score) student, enrolled student. stu_id = enrolled. stu_id student. stu_id COUNT(enrolled. course_id) > 1

How do end-users use SQLSynthesizer? SQL? Real, large database tables Desired output result SQLSynthesizer

How do end-users use SQLSynthesizer? SQL? Real, large database tables Desired output result SQLSynthesizer Small, representative Input-output examples 6

SQLSynthesizer’s advantages • Fully automated − Only requires input-output examples − No need of

SQLSynthesizer’s advantages • Fully automated − Only requires input-output examples − No need of annotations, hints, or specification of any form • Support a wide range of SQL queries − Beyond the “select-from-where” queries [Tran’ 09] 7

Comparison of solutions GUI tools Ease of Use SQLSynthesizer Programming languages Expressiveness

Comparison of solutions GUI tools Ease of Use SQLSynthesizer Programming languages Expressiveness

Outline • • • Motivation A SQL Subset Synthesis Approach Evaluation Related Work Conclusion

Outline • • • Motivation A SQL Subset Synthesis Approach Evaluation Related Work Conclusion 9

Designing a SQL subset • 1000+ pages specification • PSPace-Completeness [Sarma’ 10] • Some

Designing a SQL subset • 1000+ pages specification • PSPace-Completeness [Sarma’ 10] • Some features are rarely used The full SQL language A SQL Subset SQLSynthesizer’s focus: a widely-used SQL subset 10

How to design a SQL subset? • Previous approaches: – Decided by the paper

How to design a SQL subset? • Previous approaches: – Decided by the paper authors [Kandel’ 11] [Tran’ 09] The full SQL language ? A SQL Subset • Our approach: – Ask experienced IT professionals for the most widely-used SQL features 11

Our approach in designing a SQL subset 1. Online survey: eliciting design requirement −

Our approach in designing a SQL subset 1. Online survey: eliciting design requirement − Ask each participant to select 10 most widely-used SQL features − Got 12 responses 2. Designing the SQL subset Supported SQL features 1) 2) 3) 4) 5) SELECT. . FROM…WHERE Supported in the previous work [Tran’ 09] JOIN GROUP BY / HAVING Aggregators (e. g. , MAX, COUNT, SUM, etc) ORDER BY 3. Follow-up interview: obtaining feedback − Ask each participant to rate the sufficiency of the subset 0 5 Not sufficient at all Average rating: 4. 5 Completely sufficient

Our approach in designing a SQL subset 1. Online survey: eliciting design requirement −

Our approach in designing a SQL subset 1. Online survey: eliciting design requirement − Ask each participant to select 10 most widely-used SQL features − Got 12 respondents 1. Designing the SQL subset The SQL subset is enough to write most common queries. Supported SQL features - Supported in the previous work SELECT. . FROM…WHERE [Tran’ 09] JOIN GROUP BY / HAVING Aggregators (e. g. , MAX, COUNT, SUM, etc) ORDER BY 2. Follow-up interview: obtaining feedback − Ask each participant to rate the sufficiency of the subset 0 5 Not sufficient at all Average rating: 4. 5 Completely sufficient

Outline • • • Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 14

Outline • • • Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 14

SQLSynthesizer Workflow Select the desired query, or provide more examples Input-Output Examples Input tables

SQLSynthesizer Workflow Select the desired query, or provide more examples Input-Output Examples Input tables SQLSynthesizer Queries Output table 15

SQLSynthesizer Workflow Select the desired query, or provide more examples Input-Output Examples SQLSynthesizer Queries

SQLSynthesizer Workflow Select the desired query, or provide more examples Input-Output Examples SQLSynthesizer Queries A SQL query Input tables Output table 16

SQLSynthesizer Workflow Select the desired query, or provide more examples Input-Output Examples Combine Input

SQLSynthesizer Workflow Select the desired query, or provide more examples Input-Output Examples Combine Input tables SQLSynthesizer Filter Queries Project Output table 17

SQLSynthesizer Workflow A complete SQL: Select the desired query, or provide more examples SELECT

SQLSynthesizer Workflow A complete SQL: Select the desired query, or provide more examples SELECT name, MAX(score) Projection columns FROM student, enrolled Join condition WHERE student. stu_id = enrolled. stu_id GROUP BY student. stu_id Input-Output Queries Query condition SQLSynthesizer HAVING COUNT(enrolled. course_id) > 1 Examples Input tables Join condition Query condition Projection columns Combine Filter Project Output table 18

Multiple solutions Query 1 Query 2 Input tables Query 3 Output table … 19

Multiple solutions Query 1 Query 2 Input tables Query 3 Output table … 19

Multiple solutions Project Filter Combine Input tables Output table … … Computes all solutions,

Multiple solutions Project Filter Combine Input tables Output table … … Computes all solutions, ranks them, and shows them to the user. 20

Key techniques Join condition Query condition Projection columns Combine Filter Project Input tables Output

Key techniques Join condition Query condition Projection columns Combine Filter Project Input tables Output table 1. Combine: Exhaustive search over legal combinations (e. g. , cannot join columns with different types) Key A machine learning approach to infer query conditions 2. Filter: ion t u rib t n co 3. Project: Exhaustive search over legal columns (e. g. , cannot apply AVG to a string column) 21

Learning query conditions Cast as a rule learning problem: Finding rules that can perfectly

Learning query conditions Cast as a rule learning problem: Finding rules that can perfectly divide a search space into a positive part and a negative part Rows contained in the output table The rest of the rows All rows in the joined table 22

Search space: the joined table Table: student name stu_id Alice 1 Bob 2 Charlie

Search space: the joined table Table: student name stu_id Alice 1 Bob 2 Charlie 3 Dan 4 Table: enrolled The joined table Name stu_id course_id score 1 504 100 1 505 99 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 3 505 68 Alice Join on the stu_id column Alice (inferred in the Combine step) stu_id course_id score 1 504 100 1 505 99 2 504 96 3 501 60 3 502 88 3 505 68 Bob 23

Finding rules selecting rows contained in the output table The joined table name stu_id

Finding rules selecting rows contained in the output table The joined table name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68 Output table name MAX(score) Alice 100 Charlie 88 24

Finding rules selecting rows containing the output table The joined table name stu_id course_id

Finding rules selecting rows containing the output table The joined table name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68 name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68 25

Finding rules selecting rows containing the output table The joined table name stu_id course_id

Finding rules selecting rows containing the output table The joined table name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68 No good rules! name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

Solution: computing additional features • Key idea: – Expand the search space with additional

Solution: computing additional features • Key idea: – Expand the search space with additional features • Enumerate all possibilities that a table can be aggregated • Precompute aggregation values as features additional features Suppose grouping it by stu_id MAX(score) SUM (score) COUNT (course_id) name stu_id course_id score Alice 1 504 100 199 2 Alice 1 505 99 100 199 2 Bob 2 504 96 96 96 1 Charlie 3 501 60 88 216 3 Charlie 3 502 88 88 216 3 Charlie 3 505 68 88 216 3 The joined table . . .

Finding rules without additional features The joined table name stu_id course_id score Alice 1

Finding rules without additional features The joined table name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Bob 2 504 96 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68 No good rules! name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68 28

Finding rules with additional features The joined table after the table is grouped by

Finding rules with additional features The joined table after the table is grouped by stu_id name stu_id course_id score COUNT(course_id) MIN(score) Alice 1 504 100 2 99 Alice 1 505 99 2 99 Bob 2 504 96 1 96 Charlie 3 501 60 3 60 Charlie 3 502 88 3 60 Charlie 4 505 68 3 60 COUNT(course_id) > 1 (after groupping by stu_id) SELECT FROM WHERE name, MAX(score) student, enrolled student. stu_id = enrolled. stu_id GROUP BY student. stu_id HAVING COUNT(enrolled. course_id) > 1 … name stu_id course_id score Alice 1 504 100 Alice 1 505 99 Charlie 3 501 60 Charlie 3 502 88 Charlie 4 505 68

Ranking multiple SQL queries • Occam’s razor principle: rank simpler queries higher – A

Ranking multiple SQL queries • Occam’s razor principle: rank simpler queries higher – A simpler query is less likely to overfit the examples • Approximate a query’s complexity by its text length Input table: student name age score Alice 20 100 Bob 20 99 Charlie 30 99 Output table name Alice Bob Query 1: select name from student where age < 30 Query 2: select name from student where name = ‘Alice’ || name = ‘Bob’ 30

Outline • • • Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 31

Outline • • • Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 31

Research Questions • Success ratio in synthesizing SQL queries? • What is the tool

Research Questions • Success ratio in synthesizing SQL queries? • What is the tool time cost? • How much human effort is needed in writing examples? • Comparison to existing techniques. 32

Benchmarks • 23 SQL query related exercises from a classic textbook – All exercises

Benchmarks • 23 SQL query related exercises from a classic textbook – All exercises in chapters 5. 1 and 5. 2 • 5 forum questions – Can be answered by using standard SQL (Most questions are vendor-specific) – 2 questions contain example tables 33

Evaluation Procedure Input-Output Examples SQLSynthesizer Queries Select the desired query, or provide more examples

Evaluation Procedure Input-Output Examples SQLSynthesizer Queries Select the desired query, or provide more examples • Rank of the correct SQL query • Tool time cost • Manual cost – Example size, time cost, and the number of interaction rounds (All experiments are done by the second author) 34

Results: success ratio The correct query ranks 1 st in all succeeded questions Succeedquestions

Results: success ratio The correct query ranks 1 st in all succeeded questions Succeedquestions on 5 5 forum questions Fail on 8 questions 28 SQL questions Succeed on 20 questions Fail on 8 exercises 23 textbook exercises Require writing sub-queries, which are not supported in SQLSynthesizer Succeed on 15 exercises 35

Result: tool time cost • On average, 8 seconds per benchmark – Min: 1

Result: tool time cost • On average, 8 seconds per benchmark – Min: 1 second, max: 120 seconds – Roughly proportional to the #table and #column 36

Results: manual cost • Example size – 22 rows, on average (min: 8 rows,

Results: manual cost • Example size – 22 rows, on average (min: 8 rows, max: 52 rows) • Time cost in writing examples – 3. 6 minutes per benchmark, on average (min: 1 minute, max: 7 minutes) • Number of interaction rounds – 2. 3 rounds per benchmark, on average (min: 1 round, max: 5 rounds) 37

Comparison with an existing approach • Query-by-Output (QBO) [Tran’ 09] – Support simple “select-from-where”

Comparison with an existing approach • Query-by-Output (QBO) [Tran’ 09] – Support simple “select-from-where” queries – Use data values as machine learning features Table: student name age score Alice 20 100 Bob 20 99 Charlie 30 80 Output table select name from student where age < 30 select age, max(score) from student group by age name Alice Bob age MAX(score) 20 100 30 80 38

Query-by-Output vs. SQLSynthesizer Succeed on 2 questions 28 SQL questions Succeed on 20 questions

Query-by-Output vs. SQLSynthesizer Succeed on 2 questions 28 SQL questions Succeed on 20 questions Fail on 8 questions 28 SQL questions Fail on 26 questions Query-by-Output SQLSynthesizer - Many realistic SQL queries use aggregation features. - Users are unlikely to get stuck on simple “select-from-where” queries. 39

Experimental Conclusions • Good success ratio (71%) • Low tool time cost – 8

Experimental Conclusions • Good success ratio (71%) • Low tool time cost – 8 seconds on average • Reasonable manual cost – 3. 6 minutes on average – 2. 3 interaction rounds • Outperform an existing technique – Success ratio: QBO (7%) vs. SQLSynthesizer (71%) 40

Outline • • • Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 41

Outline • • • Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 41

Related Work • Reverse engineering SQL queries Query-by-Examples [Zloof’ 75] A new GUI with

Related Work • Reverse engineering SQL queries Query-by-Examples [Zloof’ 75] A new GUI with a domain-specific language to write queries Query-by-Output [Tran’ 09] Uses data values as features, and supports a small SQL subset. View definition Synthesis [Sarma’ 10] Theoretical analysis, and is limited to 1 input/output table. • Automated program synthesis PADS [Fisher’ 08], Wrangler [Kandel’ 11], Excel Macro [Harris’ 11], SQLShare [Howe’ 11], Snippet. Suggest [Khoussainova’ 11], SQL Inference from Java code [Cheung’ 13] − Targets different problems, or requires different input. − Inapplicable to SQL synthesis 42

Outline • • • Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 43

Outline • • • Motivation Language Design Synthesis Approach Evaluation Related Work Conclusion 43

Contributions • A programming-by-example technique – Synthesize SQL queries from input-output examples – Core

Contributions • A programming-by-example technique – Synthesize SQL queries from input-output examples – Core idea: using machine learning to infer query conditions • Experiments that demonstrate its usefulness – Accurate and efficient • Inferred correct answers for 20 out of 28 SQL questions • 8 seconds for each question – Outperforms an existing technique • The SQLSynthesizer implementation http: //sqlsynthesizer. googlecode. com 44

[Backup Slides] 45

[Backup Slides] 45

The most widely-used SQL features 21 features SELECT. . . FROM. . GROUP BY

The most widely-used SQL features 21 features SELECT. . . FROM. . GROUP BY The standard select. . from. . where. . feature ORDER BY COUNT DISTINCT AVG Aggregation features SUM IN MAX Joining features MIN FULL JOIN INNER JOIN Existential features HAVING LEFT JOIN Value matching features BETWEEN LIKE NOT NULL UNION NOT EXIST RIGHT JOIN 0 2 4 6 8 10 12 Number of votes 46

Covered features Design a SQL subset Uncovered features SELECT. . . FROM. . Covered

Covered features Design a SQL subset Uncovered features SELECT. . . FROM. . Covered 15 features GROUP BY ORDER BY COUNT DISTINCT AVG SUM IN Sub-query MAX MIN FULL JOIN INNER JOIN HAVING Special joins LEFT JOIN BETWEEN LIKE NOT NULL Wildcard matching UNION NOT EXIST RIGHT JOIN 0 2 4 6 8 10 12 47