Ranking Interesting Subgroups Stefan Rping Fraunhofer IAIS stefan
Ranking Interesting Subgroups Stefan Rüping Fraunhofer IAIS stefan. rueping@iais. fraunhofer. de
Fraunhofer Web-Projekt, Kick-off am 17. 7. 08 Motivation 1. 2. 3. 4. 5. name_score >= 1 & geoscore >= 1 & housing >= 5 p = 41. 6% Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36. 0% Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5 p = 43. 8% Families == 0 & name_score >= 1 & housing == 0 p = 28. 9% Financial_status == 0 & name_score >= 3 & housing <= 5 p = 66. 1% 2
Fraunhofer Web-Projekt, Kick-off am 17. 7. 08 Motivation 1. 2. 3. 4. 5. name_score >= 1 & geoscore >= 1 & housing >= 5 p = 41. 6% Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36. 0% Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5 p = 43. 8% Families == 0 & name_score >= 1 & housing == 0 p = 28. 9% Financial_status == 0 & name_score >= 3 & housing <= 5 p = 66. 1% 3
Fraunhofer Web-Projekt, Kick-off am 17. 7. 08 Motivation 1. 2. 3. 4. name_score >= 1 & geoscore >= 1 & housing >= 5 p = 41. 6% Income_score >= 5 & name_score >= 5 & § housing >= 5 p = 36. 0% Applying ranking to complex data: subgroup models Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5 p = § 43. 8% Optimization of data mining models for Families ==non-expert 0& users name_score >= 1 & housing == 0 p = 28. 9% 5. Financial_status == 0 & name_score >= 3 & housing <= 5 p = 66. 1% 4
Fraunhofer IAIS Overview § Introduction to Subgroup Discovery § Interesting Patterns § Ranking Subgroups • Representation • Ranking SVMs • Iterative algorithm § Experiments § Conclusions 5
Fraunhofer IAIS Subgroup Discovery § Input • X defined by nominal attributes A 1, …, Ad • Data § Subgroup language • Propositional formula Ai 1 = vj 1 Ai 2 = vj 2 … § For a subgroup S let • g(S) = #{ xi S }/n, p(S) = #{ xi S | yi = 1 }/g(S), • q(S) = g(S)a (p(S)-p 0) § Task Subgroup size and class probability p 0 = |yi = 1|/n a = 0. 5 t-test Subgroup quality = significance of pattern • Find k subgroups with highest significance (maximal quality q) 6
Fraunhofer IAIS Subgroup Discovery: Example Weather Advertised Ice Cream Sales good yes high good no high bad no low bad yes high bad no low 7
Fraunhofer IAIS Subgroup Discovery: Example Weather Advertised Ice Cream Sales good yes high good no high bad no low bad yes high bad no low S 1: Weather = good sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8)0. 5 (4/4 - 5/8) = 0. 265 8
Fraunhofer IAIS Subgroup Discovery: Example Weather Advertised Ice Cream Sales S 1: Weather = good sales = high g(S) = 4/8 good yes high good no high bad no low S 2: Advertised = yes sales = high bad yes high g(s) = 2/8 bad no low p(S) = 2/2 bad no low q(S) = (2/8)0. 5 (2/2 – 5/8) = 0. 187 p(S) = 4/4 q(S) = (4/8)0. 5 (4/4 - 5/8) = 0. 265 9
Fraunhofer IAIS Subgroup Discovery: Example Weather Advertised Ice Cream Sales S 1: Weather = good sales = high g(S) = 4/8 good yes high good no high bad no low S 2: Advertised = yes sales = high bad yes high g(s) = 2/8 bad no low p(S) = 2/2 bad no low q(S) = (2/8)0. 5 (2/2 – 5/8) = 0. 187 p(S) = 4/4 q(S) = (4/8)0. 5 (4/4 - 5/8) = 0. 265 Significance ≠ Interestingness 10
Fraunhofer IAIS Interesting Patterns What makes a pattern interesting to the user? Depends on prior knowledge, but heuristics exist § Attributes • Actionability • Acquaintedness § Sub-space ? • Novelty § Complexity • Not too complex • Not too simple 11
Fraunhofer IAIS Overview: Ranking Interesting Subgroups „S 1 > S 2“ Data Subgroup Discovery Task Modification Ranking SVM Subgroup Representation 12
Fraunhofer IAIS Subgroup Representation (1/3) § Subgroups become examples of ranking learner! § Notation • Ai = original attribute • r(S) = representation of subgroup S § Remember: important properties of subgroups • Attributes • Examples • Complexity § Representing complexity • r(S) includes g(S) and p(S)-p 0 13
Fraunhofer IAIS Subgroup Representation (2/3) Representing attributes § For each attribute Ai of the original examples include into subgroup representation attribute § Observation: TF/IDF-like representation performs even better 14
Fraunhofer IAIS Subgroup Representation (3/3) Representing examples § User may be more interested in subset of examples § Construct list of known relevant and irrelevant subgroups from user feedback § For each subgroup S and each known relevant/irrelevant subgroup T define relatedness of S to known subgroup T 15
Fraunhofer IAIS Ranking Optimization Problem § Rationale • Subgroup discovery gives quality q(S) = g(S)a (p(S)-p 0) • User defines ranking by pairs „S 1 > S 2“ (S 1 is better than S 2) • Find true ranking q* such that S 1 > S 2 <=> q*(S 1) > q*(S 2) § Assumption (justfied by assuming hidden labels of interestingness of examples) § Define linear ranking function log q*(S) = (a, 1, w) r(S) 16
Fraunhofer IAIS Ranking Optimization Problem (2/2) § Solution similar to ranking SVM § Optimization problem: § Equivalent problem: where z = r(Si, 1)-r(Si, 2). Remember log q*(S) = (a, 1, w) r(S) 17
Fraunhofer IAIS Ranking Optimization Problem (2/2) § Solution similar to ranking SVM § Optimization problem: Deviation from parameter a 0 in subgroup discovery § Equivalent problem: where z = r(Si, 1)-r(Si, 2). Remember log q*(S) = (a, 1, w) r(S) 18
Fraunhofer IAIS Ranking Optimization Problem (2/2) § Solution similar to ranking SVM § Optimization problem: Deviation from parameter a 0 in subgroup discovery § Equivalent problem: where z = r(Si, 1)-r(Si, 2). Remember log q*(S) = (a, 1, w) r(S) Constant weight for g(S) defines margin 19
Fraunhofer IAIS Iterative Procedure § Why? subgroup search ranking • Google: ~1012 web pages • Same number of possible subgroups on 12 -dimensional data set with 9 distinct values per attribute • cannot compute all subgroups for single-step ranking § Approach • Optimization problem gives new estimate of a • Transform weight of subgroups–features into weights for original examples • Idea: replace binary y with numeric value. Appropriate offset guarantees that subgroup-q is approximates optimized q* 20
Fraunhofer IAIS Experiments § Simulation on UCI data • Replace true label with most correlated attribute • Use true label to simulate user • Measure correspondence of algorithm‘s ranking with subgroups found on true label • Tests ability of approach to flexibly adapt to correlated patterns § Performance measure • Area under the curve – retrieval of true top 100 subgroups • Kendall‘s - internal consistency of returned ranking 21
Fraunhofer IAIS Results Data set AUC Diabetes 0. 256 0. 008 § Wilcoxon signed rank test confirms significance Breast-w 0. 759 0. 120 Vote 0. 664 0. 051 § 3 Data sets with minimal AUC are exactly the ones with minimal correlation between true and proxy label! Segment 0. 596 0. 601 Vehicle 0. 053 0. 500 Heart-c 0. 180 0. 036 Primary-tumor 0. 739 0. 532 Hypothyroid 0. 729 0. 307 Ionosphere 0. 227 0. 708 Credit-a 0. 050 0. 241 Credit-g 0. 019 0. 285 Colic 1. 9 E-4 0. 213 Anneal 0. 030 0. 329 Soybean 1. 9 E-4 0. 040 Mushroom 0. 542 0. 32022 mean 0. 323 0. 286
Fraunhofer IAIS Conclusions § Example of ranking on complex, knowledge-rich data § Interestingness of subgroups patterns can be significantly increased with interactive ranking-based method § Step toward automating machine learning for end-users § Future work: • Validation with true users • Active learning approach 23
- Slides: 23