Data Mining in Ubiquitous Distributed Environments Assaf Schuster
Data Mining in Ubiquitous Distributed Environments Assaf Schuster Technion SEBD Tutorial, June 06
Purpose of this Tutorial • Convergence of distributed systems and data mining • Evolving field, no systematic coverage of all aspects • Will present: issues, challenges, examples for algorithmic approaches, ideas, tradeoffs accuracy vs. overhead • Will not present: formal treatment, proofs, details, technology, systems, hardware… SEBD Tutorial, June 06
Ubiquitous Computing Systems • Various Systems: Grid, P 2 P, WSN, MANET • Several similar technological aspects – Scale, aim for at least 10 K (10 M in P 2 P) • partial failure, heterogeneity, dynamic state / data – Multi-user, a 10 K system serves >= 1 K users • resource sharing, caching, consistency – Lots of distributed data • streams, incremental, anytime, local filtering, locality filtering – Cooperation of self-motivated parties • trust management, security, privacy, competitive market, self vs. global optimizations – Stringently resource limited • in-network computing, storage distribution • Non-similar technological aspects SEBD Tutorial, June 06
Ubiquitous Data Mining • For the community – E. g. , P 2 P recommendations based on e-interaction • For Security – E. g. , identify and avert Do. S attack (Overpeer and P 2 P poisoning) • For Administration – E. g. , misconfiguration detection system (Data. Mining. Grid demo) • For Data Cleansing – E. g. , in-network outliers detection (and removal) in WSN • DM Using HPC – E. g. , idle-cycle batch systems for high-complexity analysis tasks (Superlink-Online) SEBD Tutorial, June 06
Technological Challenges: Algorithms • Scalable and resource limited distributed DM – Algorithms for 10 K peers, algorithms limited to two messages per per hour, synchronization-less, iteration-less, bag-of-tasks, dynamic divisibility, etc. • Monitoring – Distributed, local filtering • Success, Correctness, and Consistency – Partial failure, message dropping, heterogeneity, etc. can yield all sorts of trouble • Reusability, incrementality – E. g. , multi-classifiers, multi-metric k-means SEBD Tutorial, June 06 clustering, etc.
Technological Challenges: Systems • Exploitation & HCI – Lay user (parameterless) DM, interactive DM – DM-based autonomous ubiquitous systems • Security, Fraud, and Privacy – Authorization, public-key-infrastructure, trust management, data polution • Longevity of DM jobs – Resource sharing, non dedicated resources • Communication patterns – Esp. reliability and addressability. Are these problems best solved by suitable algorithms? SEBD Tutorial, June 06
- Slides: 6