Data mining from large unstructured data sources Tim

Handling diverse data l The bad news – Different organizations, different data l Can

Example 1: learning what we don’t know l l Problem: don’t even know the

Example 2: From nonsense to sense l Cluster – Inspect l Group – Classify

The good news: most collectable data is ignorable l Allows us to scale to

Real-time application l l Data intensive application Real-time monitoring of 100 variables, 1000 runs

Works, even for unstructured data (part 1 of 2) l 10, 000 s words

Works, even for unstructured data (part 2 of 2) l Top 100 words sorted

Conclusion l l Even large unstructured data sources can be automatically analyzed using data

Slides: 9

Download presentation

Data mining from large unstructured data sources Tim Menzies, Roy Nutter, WVU July 15, 2008 1

Handling diverse data l The bad news – Different organizations, different data l Can not demand detailed data storage uniformity – 80% of the electronic data in modern organizations is unstructured. doc files l Quirky spreadsheets l Email l Etc l l The good news – It (sometimes) doesn’t matter – Can still extract sense from soup 2

Example 1: learning what we don’t know l l Problem: don’t even know the structures that are out there Solution: – dimensionality disovery – dimensionality reduction – clustering And we’ll do better, later From 9 dimensions to 2 3

Example 2: From nonsense to sense l Cluster – Inspect l Group – Classify A decision tree predicting for each class 4

The good news: most collectable data is ignorable l Allows us to scale to very large problems 5

Real-time application l l Data intensive application Real-time monitoring of 100 variables, 1000 runs of a simulator Step three: structure extraction Step one: data collection Step two: massive dimensionality reduction From 100 dimensions to 2 6

Works, even for unstructured data (part 1 of 2) l 10, 000 s words 1, 000 s uniques 100 most interesting 7

Works, even for unstructured data (part 2 of 2) l Top 100 words sorted by information content (entropy) l Rules learned from top 3 words l F=2*prec*recall / (prec+recall) Top 3 terms Top 100 terms From 10, 000 dimensions to 3 8

Conclusion l l Even large unstructured data sources can be automatically analyzed using data mining. Applications to? 9