Data mining from large unstructured data sources Tim

  • Slides: 9
Download presentation
Data mining from large unstructured data sources Tim Menzies, Roy Nutter, WVU July 15,

Data mining from large unstructured data sources Tim Menzies, Roy Nutter, WVU July 15, 2008 1

Handling diverse data l The bad news – Different organizations, different data l Can

Handling diverse data l The bad news – Different organizations, different data l Can not demand detailed data storage uniformity – 80% of the electronic data in modern organizations is unstructured. doc files l Quirky spreadsheets l Email l Etc l l The good news – It (sometimes) doesn’t matter – Can still extract sense from soup 2

Example 1: learning what we don’t know l l Problem: don’t even know the

Example 1: learning what we don’t know l l Problem: don’t even know the structures that are out there Solution: – dimensionality disovery – dimensionality reduction – clustering And we’ll do better, later From 9 dimensions to 2 3

Example 2: From nonsense to sense l Cluster – Inspect l Group – Classify

Example 2: From nonsense to sense l Cluster – Inspect l Group – Classify A decision tree predicting for each class 4

The good news: most collectable data is ignorable l Allows us to scale to

The good news: most collectable data is ignorable l Allows us to scale to very large problems 5

Real-time application l l Data intensive application Real-time monitoring of 100 variables, 1000 runs

Real-time application l l Data intensive application Real-time monitoring of 100 variables, 1000 runs of a simulator Step three: structure extraction Step one: data collection Step two: massive dimensionality reduction From 100 dimensions to 2 6

Works, even for unstructured data (part 1 of 2) l 10, 000 s words

Works, even for unstructured data (part 1 of 2) l 10, 000 s words 1, 000 s uniques 100 most interesting 7

Works, even for unstructured data (part 2 of 2) l Top 100 words sorted

Works, even for unstructured data (part 2 of 2) l Top 100 words sorted by information content (entropy) l Rules learned from top 3 words l F=2*prec*recall / (prec+recall) Top 3 terms Top 100 terms From 10, 000 dimensions to 3 8

Conclusion l l Even large unstructured data sources can be automatically analyzed using data

Conclusion l l Even large unstructured data sources can be automatically analyzed using data mining. Applications to? 9