A Sequential Pattern Query Language for Supporting Instant

  • Slides: 14
Download presentation
A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services Reza Sadri

A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services Reza Sadri sadri@procom. com Carlo Zaniolo zaniolo@cs. ucla. edu Amir Zarkesh Jafar Adibi azarkesh@u 4 cast. com jabibi@u 4 cast. com

Sequential Patterns in ECommerce Applications n Applications: n n n Targeted Advertising Instant analysis

Sequential Patterns in ECommerce Applications n Applications: n n n Targeted Advertising Instant analysis of stock market trends Mining web access logs Fraud detection Requirements: n n Expressive query language for finding complex patterns in database sequences Efficient and scalable implementation: Query Optimization

State of The Art n n n ADT (e. g. . Informix Datablades): Not

State of The Art n n n ADT (e. g. . Informix Datablades): Not flexible enough, no Optimization SEQ: Enhanced ADTs (e. g. sets and sequences) with their own query language SRQL: Adding sequence algebra operators to relational model

SQL-TS n A query language for finding complex patterns in sequences n n Minimal

SQL-TS n A query language for finding complex patterns in sequences n n Minimal extension of SQL—only the from clause affected A new Query optimization technique based on extensions of the Knuth, Morris & Pratt (KMP) string-search algorithm

Example in Mining Weblogs Consider a table: Sessions(Sess. No, Click. Time, Page. No, Page.

Example in Mining Weblogs Consider a table: Sessions(Sess. No, Click. Time, Page. No, Page. Type) That keeps track of pages visited in a session (sequence of requests from the same user) Possible page type: content product description purchase

Example in Mining Weblogs n SQL-TS queries to find the ideal scenarios: SELECT B.

Example in Mining Weblogs n SQL-TS queries to find the ideal scenarios: SELECT B. Page. No, C. Click. Time FROM Sessions CLUSTER BY Sess. No SEQUENCE BY Click. Time AS (A, B, C) WHERE A. Page. Type=‘content’ AND B. Page. Type=‘product’ AND C. Page. Type=‘purchase’

Example in Fraud Detection Consider a Table log that keeps track of credit card

Example in Fraud Detection Consider a Table log that keeps track of credit card transactions: Spending(Date, Account. No, Amount) A surge in average spending might be sign of credit card theft.

Example in Fraud Detection Track 30 -day average spending and when it increases considerably

Example in Fraud Detection Track 30 -day average spending and when it increases considerably for two consecutive days: Select Z. Account. No, Z. Date FROM Spending CLUSTER BY Account. No SEQUENCE BY Date AS (*X, Y, Z) WHERE COUNT(*X)=30 AND Y. Amount > 5 * AVG(*X. Amount) AND Z. Amount > 5 * AVG(*X. Amount) Notice the Use of star and aggregates.

Optimized string search: KMP Consider text array text and pattern array p: i text[i]

Optimized string search: KMP Consider text array text and pattern array p: i text[i] j pattern[j] 1 2 a b 3 4 5 6 7 8 9 10 11 a b c a 3 4 5 6 a b c a After failing, use the information acquired so to: - backtrack to shift(j), rather than i+1, and - only check pattern values after next(j) But in SQL-TS we have general predicates & star patterns

shift and next n n n Success for first j-1 elements of pattern. Failure

shift and next n n n Success for first j-1 elements of pattern. Failure for jth element (when input is at i) Any shift less than shift(j) is guaranteed to lead to failure, Match elements in the pattern starting at next(j) i–j+1 i – j + shift(j) + 1 i - j + shift(j) + next(j) i Input 1 shift(j) + next(j) j Pattern 1 next(j) j - shift(j) Shifted Pattern shift(j)

Optimal Pattern Search (OPS) Search path for naive algorithm vs. optimized algorithm:

Optimal Pattern Search (OPS) Search path for naive algorithm vs. optimized algorithm:

Optimizing Star Patterns Relaxed Double Bottom: n Only considering increases and decreases that are

Optimizing Star Patterns Relaxed Double Bottom: n Only considering increases and decreases that are more than 2% *U (less than 2% change) *T *V *Y *R *Z (less than 2% change) *W (less than 2% change)

Relaxed Double Bottom: Ninety fold improvement

Relaxed Double Bottom: Ninety fold improvement

Conclusions n n n Disjunctive queries, partial ordered domains, aggregates also treated in this

Conclusions n n n Disjunctive queries, partial ordered domains, aggregates also treated in this approach Old applications—more power & flexibility than Datablades ADTs of commercial DBMSs Ongoing implementation, by building on the user-defined aggregates supported in AXL.