PIER PHI Overview of Challenges Opportunities Ryan Huebsch
PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch† Joe Hellerstein† °, Boon Thau Loo†, Sam Mardanbeigi†, Scott Shenker†‡, Ion Stoica† p 2 p@db. cs. berkeley. edu †UC Berkeley, CS Division ‡International Computer Science Institute, Berkeley CA ° Intel Research Berkeley STREAM DAY 5/7/04
PIER n P 2 P Information Exchange & Retrieval n n A wide-area distributed dataflow engine Outfitted with relational operators Designed to scale to thousands or millions of nodes Motivation: n n It’s an interesting challenge Lowers the barrier of entry for large-scale applications n n No massive infrastructure for server farms Cost is distributed among participants Provide a viable solution where other options are not socially acceptable We are NOT trying be better than other (centralized) solutions, we are trying to be different.
Challenges General Challenges Declarative Queries Security Privacy Quality of Service Query Plan Query Optimization Multi-Query Optimization Catalogs Persistent Storage Recursion Overlay Network Physical Network Query Dissemination Replication Soft-State Quality of Service Resilience Route Flapping Efficiency
Applications & Requirements n File sharing n n Network Monitoring n n n Flooding works for popular items Need something better for rare items May want ‘triggers’ when a new item matches an old search Aggregation & grouping very common Continuous queries with well defined semantics PHI is one use of PIER…
PHI n n n Public Health for the Internet Community-based monitoring The metaphor: n Old way – Treat computers with medicine n n New way – Monitor the community n n n Like the Center for Disease Control Global CDC has social implications n n Virus protection Central repository, privacy, who controls it, who pays for it… PHI wants to create the Center for Disease Control without the Center (of control) Motivation is to inform users about the dangers of the Internet
PHI Example n PIER is currently deployed on 150 -300 Planet. Lab nodes. n n SNORT is the primary data source n n ~100 sites Some nodes on DSL, 1 Mbps, 10 Mbps, etc. Very unreliable ~2400 rules 10’s - 1000’s of tuples per day per node Schema: time, rule, source socket, destination socket Quick Demo: n Shows the top ten sources of events across all of Planet. Lab (live), i. e. who are the bad guys?
What’s next… n PIER n n Lots of problems, including the meta-problem of what problem to work on No streaming semantics, no language to describe windows, etc… n n Additional challenges: Interaction with soft-state, no synchronized clocks, unknown (changing) network latencies PHI n Create a complete application n n Gets intrusion data from a variety of sources (including the built -in Windows Firewall Develop a snazzy visualization Release to the world, first using Planet. Lab as the query processor, eventually the world Scale to at least 10, 000’s nodes and explore the design space
- Slides: 8