Data mining from large unstructured data sources Tim
- Slides: 9
Data mining from large unstructured data sources Tim Menzies, Roy Nutter, WVU July 15, 2008 1
Handling diverse data l The bad news – Different organizations, different data l Can not demand detailed data storage uniformity – 80% of the electronic data in modern organizations is unstructured. doc files l Quirky spreadsheets l Email l Etc l l The good news – It (sometimes) doesn’t matter – Can still extract sense from soup 2
Example 1: learning what we don’t know l l Problem: don’t even know the structures that are out there Solution: – dimensionality disovery – dimensionality reduction – clustering And we’ll do better, later From 9 dimensions to 2 3
Example 2: From nonsense to sense l Cluster – Inspect l Group – Classify A decision tree predicting for each class 4
The good news: most collectable data is ignorable l Allows us to scale to very large problems 5
Real-time application l l Data intensive application Real-time monitoring of 100 variables, 1000 runs of a simulator Step three: structure extraction Step one: data collection Step two: massive dimensionality reduction From 100 dimensions to 2 6
Works, even for unstructured data (part 1 of 2) l 10, 000 s words 1, 000 s uniques 100 most interesting 7
Works, even for unstructured data (part 2 of 2) l Top 100 words sorted by information content (entropy) l Rules learned from top 3 words l F=2*prec*recall / (prec+recall) Top 3 terms Top 100 terms From 10, 000 dimensions to 3 8
Conclusion l l Even large unstructured data sources can be automatically analyzed using data mining. Applications to? 9
- Mining complex data types
- Mining multimedia databases
- Tools to convert unstructured data to structured data
- Print and web sources
- Water management importance
- Strip mining vs open pit mining
- Strip mining vs open pit mining
- Difference between strip mining and open pit mining
- Text and web mining
- Unstructured data growth