PAKDD Panel What Next Ramakrishnan Srikant What Next

  • Slides: 13
Download presentation
PAKDD Panel: What Next Ramakrishnan Srikant

PAKDD Panel: What Next Ramakrishnan Srikant

What Next • Electronic Commerce – Catalog Integration (WWW 2001, with R. Agrawal) –

What Next • Electronic Commerce – Catalog Integration (WWW 2001, with R. Agrawal) – Searching with Numbers (WWW 2002, with R. Agrawal) • Security • Privacy

Catalog Integration • B 2 B electronics portal: 2000 categories, 200 K datasheets Master

Catalog Integration • B 2 B electronics portal: 2000 categories, 200 K datasheets Master Catalog New Catalog

Intuition • Use affinity information in new catalog. – Products in same category are

Intuition • Use affinity information in new catalog. – Products in same category are similar. • Bias Naïve Bayes classifier to incorporate this information. – Accuracy boost depends on match between two categorizations. – Use tuning set to determine weight given to affinity information.

Yahoo & Google • 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software

Yahoo & Google • 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software – Typical match: 69%, 15%, 3%, 1%, …. • Merging Yahoo into Google – 30% fewer errors (14. 1% absolute difference in accuracy) • Merging Google into Yahoo – 26% fewer errors (14. 3% absolute difference) • Open Problems: SVM, Decision Tree, . . .

Data Extraction is hard • Synonyms for attribute names and units. – "lb" and

Data Extraction is hard • Synonyms for attribute names and units. – "lb" and "pounds", but no "lbs" or "pound". • Attribute names are often missing. – No "Speed", just "MHz Pentium III" – No "Memory", just "MB SDRAM" • 850 MHz Intel Pentium III • 192 MB RAM • 15 GB Hard Disk • DVD Recorder: Included; • Windows Me • 14. 1 inch diplay • 8. 0 pounds

Searching with Numbers

Searching with Numbers

Why does it work? • Conjecture: If we get a close match on numbers,

Why does it work? • Conjecture: If we get a close match on numbers, it is likely that we have correctly matched attribute names. • Non-overlapping attributes: – Memory: 64 - 512 Mb, Disk: 10 - 40 Gb • Correlations: – Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still fine.

Empirical Results

Empirical Results

Incorporating Hints • Use simple data extraction techniques to get hints, • Names/Units in

Incorporating Hints • Use simple data extraction techniques to get hints, • Names/Units in query matched against Hints. • Open Problem: Rethink data extraction in this context.

Security

Security

Some Hard Problems • Past may be a poor predictor of future – Abrupt

Some Hard Problems • Past may be a poor predictor of future – Abrupt changes • Reliability and quality of data – Wrong training examples • Simultaneous mining over multiple data types • Richer patterns

Privacy Preserving Data Mining • Have your cake and mine it too! – Preserve

Privacy Preserving Data Mining • Have your cake and mine it too! – Preserve privacy at the individual level, but still build accurate models. • Challenges – Privacy Breaches – Clustering & Associations – Privacy-sensitive Security Applications • Opportunities – Web Demographics – Inter-Enterprise Data Mining – Privacy-sensitive Security Applications