Do We Really Sample Right In ModelBased Diagnosis

  • Slides: 17
Download presentation
Do We Really Sample Right In Model-Based Diagnosis? DX’ 20 Patrick Rodler Fatima Elichanova

Do We Really Sample Right In Model-Based Diagnosis? DX’ 20 Patrick Rodler Fatima Elichanova

Motivation assume an election poll: – ask only university professors for whom they will

Motivation assume an election poll: – ask only university professors for whom they will vote – will the result of the poll be representative of the entire population? similar thing is often done in MBD – task: find actual diagnosis (= actually faulty components) among a (large) set of diagnoses – computing all diagnoses intractable compute only a sample of diagnoses – use sample to make estimations that guide diagnostic actions (system meaurements) – draw best-first samples (most probable or min-cardinality diagnoses) but: – statistical law: “A randomly chosen unbiased sample from a population allows (on average) better conclusions and estimations about the whole population than any other sample. “ – questions of interest: • does this apply to MBD as well? • or are best-first samples really more informative than random ones in MBD? • perhaps we could do better by using randomized algorithms to generate our diagnoses? contribution: – extensive experiments to bring light to these questions DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 2

Example (Impact of Sample in MBD) • most MP selection heuristics used in MBD

Example (Impact of Sample in MBD) • most MP selection heuristics used in MBD use exactly these two properties! sample S 3 all diagnoses (unknown) sample S 2 sample S 1 DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 3

Example (Impact of Sample in MBD) which MP is better wrt. information gain? •

Example (Impact of Sample in MBD) which MP is better wrt. information gain? • sample S 3 all diagnoses (unknown) sample S 2 sample S 1 DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 4

Evaluation Approach (Overview) comprehensive evaluations: 8 x 6 x 5 x 4 = 960

Evaluation Approach (Overview) comprehensive evaluations: 8 x 6 x 5 x 4 = 960 factor combinations – using • 8 real-world diagnosis problems • 6 different sample types • 5 different sample sizes • 4 different measurement selection heuristics (best-first, random, etc) (2, 6, 10, 20, 50) (information gain, etc) of MP properties: elim rate and prob of outcomes – to assess • theoretical representativeness: how good estimations sample types produce • practical representativeness: how efficiently sample types guide the finding of the actual diagnosis in sequential diagnosis DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 5

Evaluation Approach (Details) dataset real-world diagnosis problems (domain: KB/ontology debugging) # min conflicts #

Evaluation Approach (Details) dataset real-world diagnosis problems (domain: KB/ontology debugging) # min conflicts # min diags size mincard diag size mincard conflict ------------------ size maxcard min diag size maxcard min conflict --------- # components |COMPS| complexity of consistency check (roughly: more letters, higher complexity) --------- criterion: computation of all min diags possible within reasonable time for experiments (single digit # of minutes) reason: draw unbiased random samples all beyond NP-complete DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 6

Evaluation Approach (Details) sample types – best-first – random – worst-first – approx best-first

Evaluation Approach (Details) sample types – best-first – random – worst-first – approx best-first – approx random – approx worst-first (bf) (rd) (wf) (abf) (ard) (awf) most probable diagnoses unbiased random selection from min diags least probable diagnoses baselines heuristic approximations bf, rd, wf are specific samples sampling outcome is precisely predefined usually more expensive (exact techniques) abf, ard, awf are unspecific samples exact sampling outcome not known usually less expensive (heuristic / approximate techniques) computation of samples bf uniform-cost HS-Tree rd determine all min diags, sample randomly wf determine all min diags, select least prob diags abf Inv-QX with sorting of COMPS by prob in desc order ard Inv-QX with random sorting of COMPS abf Inv-QX with sorting of COMPS by prob in asc order DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? HS-Tree Inv-QX [Reiter, 1987] [Schekotihin et al, 2014] 7

Evaluation Approach (Details) • accuracy efficiency DX‘ 20 Do We Really Sample Right In

Evaluation Approach (Details) • accuracy efficiency DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 8

Evaluation Approach (Details) • 8 x 6 x 50 = 12 000 pairs of

Evaluation Approach (Details) • 8 x 6 x 50 = 12 000 pairs of tuples for both prob and elim-rate to be compared note: approximation! computation of real value computationally intractable (requires all min and non-min diagnoses) DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 9

Evaluation Approach (Details) • 8 x 6 x 5 x 4 = 960 combinations,

Evaluation Approach (Details) • 8 x 6 x 5 x 4 = 960 combinations, 9600 sequential diagnosis sessions DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 10

Results • ranking does not mean that higher ranked Ti is always better than

Results • ranking does not mean that higher ranked Ti is always better than Tj table gives no info about how much better (worse) one Ti was than another Tj better = higher Pearson correlation coefficient between estimated and real values sample type Ti being ranked prior to type Tj = Ti was better than Tj in more of the factor combinations of the respective scenario possible explanation: usually large # of diags with low prob, i. e. “sub-population” from which wf “selects” is large and thus representative data for elim rate estimation (E) DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 11

Results • ranking does not mean that higher ranked Ti is always better than

Results • ranking does not mean that higher ranked Ti is always better than Tj table gives no info about how much better (worse) one Ti was than another Tj better = higher Pearson correlation coefficient between estimated and real values sample type Ti being ranked prior to type Tj = Ti was better than Tj in more of the factor combinations of the respective scenario possible explanation: few most probable diags often account for a major part of overall prob mass – hence: better estimations data for prob estimation (P) DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 12

Results ENT = select MP with highest information gain SPL = select MP with

Results ENT = select MP with highest information gain SPL = select MP with lowest worst-case elim-rate RIO = dynamic learned combination of ENT + SPL MPS = select MP with max prob of max elim rate • data for # measurements in diagnosis session (M) surprising: SPL uses elim rate only, no probs; rd estimates elim rate better than bf further analysis required DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 13

Results ENT = select MP with highest information gain SPL = select MP with

Results ENT = select MP with highest information gain SPL = select MP with lowest worst-case elim-rate RIO = dynamic learned combination of ENT + SPL MPS = select MP with max prob of max elim rate • reason: “brute force” sampling = precompute all min diags + select data for computation time in diagnosis session (T) DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 14

Results RQ 2: Which type of sample is • For small sample size <

Results RQ 2: Which type of sample is • For small sample size < 10 • For sample size 20 • For large sample size (50) • For {ENT, SPL} • For RIO • For MPS best in terms of practical representativeness (criterion M+T)? bf these appear to be the abf most common awf (if time for measurements low) scenarios in MBD practice bf (else) rd (if efficient method for rd) evaluation suggests: bf (else) how we sample in MBD is in fact plausible bf ard (if time for measurements large) rd (else if efficient method for rd) bf (else) ard overall time for diagnosis session (computation time + measurement time) ENT = select MP with highest information gain SPL = select MP with lowest worst-case elim-rate RIO = dynamic learned combination of ENT + SPL MPS = select MP with max prob of max elim rate DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 15

Results RQ 3: Are the results wrt. RQ 1 and RQ 2 consistent over

Results RQ 3: Are the results wrt. RQ 1 and RQ 2 consistent over different (a) sample sizes, (b) measurement selection heuristics, and (c) diagnosis problem instances? • RQ 1 fairly consistent rankings for (a), lesser for (c) recall: (b) not applicable for RQ 1 (no heuristics involved in EXP 1) results stable wrt. winning strategy • RQ 2 more fluctuation here for (a) and (b) for (c): rankings for T decidedly more stable than for M relative sampling time less affected by different problem instances than relative informativeness of samples RQ 4: Does larger sample size (more computed diagnoses) imply better representativeness? • theoretical representativeness yes in line with earlier studies • practical representativeness no (obvious for T, less so for M) [de. Kleer, 1995; Rodler, Schmid, 2018] RQ 5: Does a better theoretical representativeness translate to a better practical representativeness? • from our results this cannot be generally concluded • possible explanations: 1. heuristics are based on a lookahead of one step only (approx character of this analysis might counteract benefit of good estimations) 2. the added (information) value of additional diagnoses taken into a sample, regardless of how selected, decreases with the sample size (cf. law of diminishing marginal utility) DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 16

Conclusions goal: assess impact of different sample types on diagnostic decision making and efficiency

Conclusions goal: assess impact of different sample types on diagnostic decision making and efficiency special focus on • statistical unfoundedness of best-first samples • theoretical attractiveness of random samples (commonly used in MBD) (not commonly used in MBD) extensive experiments (8 diagnosis problems, 6 sample types, 5 sample sizes, 4 heuristics) bottom line: • random samples • • very good estimations but only (most) efficient for large samples + one particular heuristic open problem (given efficient random diagnosis sampling algo) for small sample size + most commonly used heuristics (e. g. info gain), best-first samples best for medium sample size, samples generated by Inv-QX [Schekotihin et al, 2014] are favorable larger samples better estimations, but no higher diagnostic efficiency better estimations do not (generally) translate to better diagnostic efficiency limitations: stat signif? / not evaluated “best-first“ = min-card / omitted less common heuristics / further non-evaluated sample types exist / only binary-outcome measurements DX‘ 20 Do We Really Sample Right In Model-Based Diagnosis? 17