Dynamic Query Forms for Database Queries 1 1

Dynamic Query Forms for Database Queries *1 *1 *1 *2 Liang Tang, Yexi Jiang, Tao Li and Zhiyuan Chen School of Computer Science, Florida International University, Miami, Florida, USA *1 Department of Information Systems, University of Maryland Baltimore Country, Baltimore, Maryland, USA *2 IEEE Transactions on Knowledge and Data Engineering, Vol. 26, No. 9, September 2014 2015/10/30 M 1 andy 1

Main Author Liang Tang • Ph. D. student • School of Computing & Information Sciences, Florida International University Research Interests • data and information analysis • • data mining machine learning information retrieval techniques user modeling and personalization in recommender systems, online advertising, multi-arm 2 bandit algorithms and event mining

Outline 1. Introduction 2. Query Form interface 3. Ranking Metric 4. Estimation of Ranking Score i) Projection Form Components ii) Selection Form Components 5. Evaluation 6. Conclusion and Future Work 3

Outline 1. Introduction 2. Query Form interface 3. Ranking Metric 4. Estimation of Ranking Score i) Projection Form Components ii) Selection Form Components 5. Evaluation 6. Conclusion and Future Work 4

Introduction Query Form • one of the most widely used user interfaces for querying databases • Traditional query forms are designed and predefined by developers or DBA 5

Introduction Query Form • Modern databases become very large and complex • with the rapid development of web information and scientific databases ➡ It is difficult to design static query forms to satisfy various ad-hoc database queries 6

Authors’ Approach In this paper, authors propose Dynamic Query Form system: DQF • query interface which is capable of dynamically generating query forms for users 7

Authors’ Approach In this paper, authors propose Dynamic Query Form system: DQF • query interface which is capable of dynamically generating query forms for users • The essence of DQF is to capture user interests during user interactions and to adapt the query form iteratively • Each iteration consists of two types of user interactions: Query Form Enrichment and Query Execution 8

Authors’ Approach Dynamic Query Form system: DQF ✓ ranking of query form components ✓ dynamic generation of query forms 9

Outline 1. Introduction 2. Query Form interface 3. Ranking Metric 4. Estimation of Ranking Score i) Projection Form Components ii) Selection Form Components 5. Evaluation 6. Conclusion and Future Work 10

Query Form Interface • Authors formally define the query form 11

Query Form query form F = (AF, RF, σF, ▹◃(RF)) ➡ F = (SELECT A 1, A 2, …, Ak FROM ▹◃(RF) WHERE σF) AF = {A 1, A 2, …, Ak} • k>0 k attributes for projection RF = {R 1, R 2, …, Rn} • n>0 the set of n relations (or entities) involved in the query ▹◃(RF) • a join function to generate a conjunction of expressions for joining relations of RF σF • a conjunction of expressions 12 for selections (or conditions) on

Query Form query form F = (AF, RF, σF, ▹◃(RF)) ➡ F = (SELECT A 1, A 2, …, Ak FROM ▹◃(RF) WHERE σF) • AF is the set of columns of the result table • σF is the set of input components for users to fill • RF and ▹◃(RF) are not visible in the user interface • • RF is determined by AF and σF ▹◃(RF) is automatically constructed according to the foreign keys among relations in RF 13

Query Form query form F = (AF, RF, σF, ▹◃(RF)) ➡ F = (SELECT A 1, A 2, …, Ak FROM ▹◃(RF) WHERE σF) • RF is the union set of relations which contains at least one attribute of AF or σF ✓ The components of query form F are actually determined by AF and σF 14

Query Results A user does not have time to check every data instance • to decide whether a query form is desired or not Many database queries output a huge amount of data instances ➡ In order avoid this “Many-Answer” problem • the system only output a compressed result table to show a high-level view of the query results first 15

Query Results • Each instance in the compressed table represents a cluster of actual data instances ✓ The user can click through interested clusters to view the detailed data instances 16

Query Results Incremental data clustering framework is chosen • to generate the compressed view efficiently • Clustering is just to provide a better view of the query results for the user 17

Query Results Another important usage of the compressed view • to collect the user feedback • The system can estimate the goodness of a query form by using the collected feedback ➡ It can recommend appropriate query form components ✓ The click-through on the compressed view table is an implicit feedback • to tell the system which cluster (or subset) of data instances is desired by the user 18

Query Results • Duf : the clicked subset Duf is only a subset of all user desired data instances ✓ Duf can help the system generate recommended form components • that help users discover more desired data instances 19

Outline 1. Introduction 2. Query Form interface 3. Ranking Metric 4. Estimation of Ranking Score i) Projection Form Components ii) Selection Form Components 5. Evaluation 6. Conclusion and Future Work 20

Ranking Metric Two traditional measures to evaluate the quality of the query results • • Precision Recall Query forms are able to produce different queries by different inputs Different queries can output different query results and achieve different precisions and recalls ➡ Expected precision and expected recall are used to 21

Ranking Metric Expected precision = expected proportion of the query results which are interested by the current user Expected recall = the expected proportion of user interested data instances which are returned by the current query form ✓ User interest • estimated based on the user’s click-through on query results 22

Ranking Metric • If some data instances are clicked by the user • • these data instances must have high user interests The query form components which can capture these data instances should be ranked higher than other components 23

Ranking Metric Symbols used in this paper 24

Ranking Metric Symbols used in this paper • P(d) : occurrence probability of d in D • P(σF | d) : the probability of “d satisfies σF” • P(σF | d) ∈ {0, 1} • • P(σF | d) = 1 if d is returned by F P(σF | d) = 0 otherwise 25

Ranking Metric Symbols used in this paper • • DAF : a projected database P(d. AF) : the probability of projected instance d. AF in the projected database DAF ✓ P(d. AF) may be greater than 1/N • • • there are often duplicated projected instances Pu(d) : the probability of d being desired by the user Pu(d. AF) : the probability of the user being interested in a projected instance 26

Ranking Metric Example • D = {I 1, I 2, …, I 5} • (N = 5) A = {C 1, C 2, …, C 5} • Q : “SELECT C 2, C 5 FROM D WHERE C 2 = b 1 OR C 2 = b 2” • • DQ = {I 1, I 2, I 3, I 4} P(σFi | d) • • • 1 for I 1 to I 4 0 for I 5 P(I 1 C 2, C 5) = 2/5 27

Ranking Metric Definition of two measures for query forms • expected precision and expected recall 28

Ranking Metric • The overall performance measure is derived • by considering both expected precision and expected recall ➡ F-Measure β : constant parameter to control the preference 29

Ranking Metric In authors’ system • A ranked list of query form components is provided for the user • The form components are ranked in descending order of FScore. E(Fi+1) • for maximizing the goodness of the next query form 30

Outline 1. Introduction 2. Query Form interface 3. Ranking Metric 4. Estimation of Ranking Score i) Projection Form Components ii) Selection Form Components 5. Evaluation 6. Conclusion and Future Work 31

Estimation of Ranking Score DQF provides a two-level ranked list for projection components 1. the ranked list of entities 2. the ranked list of attributes in the same entity 32

Ranking Projection Form Components • From the Definition 3, FScore. E(Fi+1) is obtained as follows: 33

Ranking Projection Form Components Pu(d. AFi+1) for the projection components • Aj+1 is the projection attribute we want to suggest for the Fi+1 • Pu(d. AFi+1) can be estimated by the user’s click-through on results of Fi • Duf ⊆ D : a set of data instances which are clicked by the user in previous query results 34

Ranking Projection Form Components • Kernel density estimation method is applied to estimate Pu(d. AFi) • Each db ∈ Duf represents a Gaussian distribution of the user’s interest 35

Ranking Projection Form Components ✓ Pu(d. Aj+1 | d. AF ) is not visible in the run-time data i • • d. Aj+1 has not been used before Fi+1 it can be estimated only from other data sources ➡ For estimating the conditional probability Pu(d. Aj+1 | d. AF ), i the following two data-driven approaches are considered • • Workload-Driven Approach Schema-Driven Approach 36

Ranking Projection Form Components Workload-Driven Approach • Pu(d. Aj+1 | d. AFi) could be estimated from query results of historic queries • If a lot of users queried attributes AFi and Aj+1 together on instance d, then Pu(d. Aj+1 | d. AFi) must be high Schema-Driven Approach • The database schema implies the relations of the attributes • If two attributes are contained by the same entity, then they are more relevant 37

Ranking Projection Form Components Each of the two approaches has its own drawback • The workload-driven approach has the cold-start problem • • it needs a large amount of queries The schema-driven approach is not able to identify the difference of the same entity’s attributes In authors’ system • similarity The two approaches are combined as follows: λ: weight parameter in [0, 1] 38

Ranking Projection Form Components Similarity between Aj+1 and AFi estimated from the database schema • d(Aj+1, A) : the schema distance between the attribute Aj+1 and A in the schema graph • dmax : the diameter of the schema graph In this paper • The schema graph are utilized to compute the relevance of two attributes 39

Ranking Projection Form Components Database schema graph : G = (R, FK, ξ, A) • • R : the set of nodes representing the relations A : the set of attributes FK : the set of edges representing the foreign keys ξ : A → R, an attribute labeling function to indicate which relation contains the attribute 40

Ranking Projection Form Components Ranking score of an entity • just the averaged FScore. E(Fi+1) of that entity’s attributes ✓ If one entity has many high score attributes, then it should have a higher rank 41

Ranking Selection Form Components • The selection attributes must be relevant to the current projected entities • Otherwise, selection would be meaningless ➡ The system should first find out the relevant attributes for creating the selection components 42

Ranking Selection Form Components • The relevance of attributes is measured based on the database schema as follows • Relevant attribute Ar(F) can be obtained in O(|Ar(F)| · t) • by using the depth-first traversing of the database schema graph 43 how compact of the schema is ✓ The choice of t depends on

Ranking Selection Form Components • For enriching selection form components of a query form • the set of projection components AF is fixed (AFi+1 = AFi) ✓ FScore. E(Fi+1) only depends on σFi+1 44

Ranking Selection Form Components • For the simplicity of the user interface • most query forms’ selection components are simple binary relations in the form of “Aj op cj” • • Aj : an attribute cj : a constant op : a relational operator (‘=’, ‘≥’, ‘≤’ etc. ) In each cycle • The system provides a ranked list of such binary relations for users to enrich the selection part • Only the best selection component for each attribute is selected 45

Ranking Selection Form Components • For attribute As • • σFi+1 = σFi ∪ {s} (As ∈ Ar(F)) (s ∈ σ and s contains As) In order to find the s ∈ σ that maximizes the FScore. E(Fi+1) • the system only need to estimate P(σFi+1 | d) for each data instance d ∈ D ✓ In authors’ system, σF represents a conjunctive expression 46

Ranking Selection Form Components • P(σFi | d) can be estimated by previous queries executed on query form Fi • P(s |σFi , d) : • • 1 if and only if d satisfies σFi and s 0 otherwise ➡ The only problem is to determine the space of s • the system have to be enumerate all the s to compute their score 47

Ranking Selection Form Components In this paper One-Query method is proposed • in order to efficiently estimate the new FScore induced by query condition s • • It sorts the values of an attribute in s and incrementally computes the FScore on all possible values for that attribute 48

Ranking Selection Form Components ✓ P(σFi+1 | d) depends on the previous query conditions σFi • If P(σFi | d) = 0, P(σFi+1 | d) must be 0 ➡ The system don’t need to retrieve all data instances in the database ✓ Only the set of data instances D’⊆ D such that each d ∈ D′satisfies P(σFi | d) > 0 is needed ➡ The selection of One-Query’s 49 query is the union of query

Ranking Selection Form Components One-Query algorithm • does not send each query condition s to the database engine to select data instances • It retrieves the set of data instances D’, and checks every data instance with every query condition by its own • It needs to know the values of all selection attributes of D’ ➡ One-Query adds all the selection attributes into the projections of the query 50

Ranking Selection Form Components • When the system receives the result of the query Qone, it calls the second algorithm of One-Query • to find the best query condition 51

Ranking Selection Form Components For any query conditions (“=”, “≤” etc. ) • Incremental approaches to compute their FScore can be found • Time complexity of finding the best query condition for an attribute is O(|DQone| · |AFi|) • Ranking every attribute’s selection component is O(|DQone| · |AFi| · |Ar(Fi)|) 52

Outline 1. Introduction 2. Query Form interface 3. Ranking Metric 4. Estimation of Ranking Score i) Projection Form Components ii) Selection Form Components 5. Evaluation 6. Conclusion and Future Work 53

Evaluation The goal of authors’ evaluation is to verify the following hypotheses H 1 : Is DQF more usable than existing approaches such as static query form and customized query form? H 2 : Is DQF more effective to rank projection and selection components than the baseline method and the random method? H 3 : Is DQF efficient to rank the recommended query form components in an online user interface? 54

Evaluation System prototype 55

Evaluation Data sets • 3 databases : NBA, Green Car and Geobase 56

Evaluation Authors compare three approaches to generate query forms • DQF: The dynamic query form system proposed in this paper • SQF: The static query form generation approach • • using query workload CQF: The customized query form generation used by many existing database clients • such as Microsoft Access, Easy. Query 57

Evaluation User Study • to evaluate the usability of authors’ approach • 20 participants of graduate students, UI designers, and software engineers 58

User Study two phases of User Study 1. query collection phase • Each participant uses the system to submit some queries and these queries were collected • Collected queries were used as query workload to train the system 2. testing phase • • Authors asked each participant to complete 12 tasks Each participant used all three form generation approaches to form queries 59

User Study 12 tasks of User Study 60

Evaluation Simulation Study • Authors used the collected queries in a large scale simulation study • Queries were partitioned into a training set and testing set by using a cross-validation approach • Authors evaluated the average performance for 61 testing sets

Evaluation Usability Metrics • metrics in Human-Computer Interaction and Software Quality for measuring the usability of a system 62

Evaluation User Study Result 63

Evaluation User Study Result 64

Evaluation User Study Result 65

Complexity Form Complexity Average Form Complexity DQF SQF NBA 8. 1 30 Green Car 4. 5 16 Geobase 6. 7 66

Effectiveness 67

Efficiency 68

Outline 1. Introduction 2. Query Form interface 3. Ranking Metric 4. Estimation of Ranking Score i) Projection Form Components ii) Selection Form Components 5. Evaluation 6. Conclusion and Future Work 69

Conclusion and Future Work • Authors propose a dynamic query form generation approach which helps users dynamically generate query forms • The key idea is to use a probabilistic model to rank form components based on user preferences • Experimental results show that the dynamic approach often leads to higher success rate and simpler query forms compared with a static approach • As future work, authors will study how their approach can be 70 extended to non relational data