Characterizing Files in the Modern Gnutella Network A

  • Slides: 33
Download presentation
Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stutzbach,

Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stutzbach, Reza Rejaie University of Oregon SPIE Multimedia Computing and Networking 2006 (MMCN’ 06), 18 -19 th January 2006 1 San Jose, California, USA

Outlines l l l Measurement study of modern Gnutella system Conduct static, topological and

Outlines l l l Measurement study of modern Gnutella system Conduct static, topological and dynamic analysis Help to improve design and evaluations of P 2 P file-sharing applications 2

Previous studies l l l Focus on a small population Be more than three

Previous studies l l l Focus on a small population Be more than three years old Not examine dynamics of file characteristics over time and correlation between the overlay topology and file distribution 3

Why Gnutella l l l Top three (e. Donkey 2 K, Fast. Track, Gnutella)

Why Gnutella l l l Top three (e. Donkey 2 K, Fast. Track, Gnutella) Gnutella has Browse-Host extension to extract the list of shared files from peers One of most studied P 2 P systems; compare and contrast with previous studies 4

Original Gnutella l l l A new node joins the system (Node A) Node

Original Gnutella l l l A new node joins the system (Node A) Node A connects to some node (Node B) by preexisting list, a particular website, IRC and etc Node B sends its working nodes to Node A connects provided nodes till certain threshold During search, Node A sends requests to connected nodes which in turn forward requests 5

Original Gnutella l l l Nodes reply the request directly or indirectly depending on

Original Gnutella l l l Nodes reply the request directly or indirectly depending on the firewall existence Node A downloads file pieces from one ore more positive nodes Unlike Napster, Gnutella is decentralized; flood-based searches 6

Modern Gnutella l l l Contrast to unstructured overlay topology, most modern Gnutella clients

Modern Gnutella l l l Contrast to unstructured overlay topology, most modern Gnutella clients adopt a two-tier overlay structure Ultrapeers and leaf peers (majority) Legacy peers (not implement ultrapeer feature) 7

Measurement methodology l Problems of general crawlers l l Previous studies l l l

Measurement methodology l Problems of general crawlers l l Previous studies l l l Slow, distorted, inflate population Partial snapshot, periodic probe of a fixed group Significance is doubted Goal of this work l l Capture entire population (? ) Short period 8

Measurement methodology l Topology crawl l l List of neighboring nodes Content crawl l

Measurement methodology l Topology crawl l l List of neighboring nodes Content crawl l l List of available files of each node Need more 9

Cruiser l l l Parallel P 2 P crawler Orders of magnitude faster than

Cruiser l l l Parallel P 2 P crawler Orders of magnitude faster than previous crawlers (? ) Master-slave architecture l l Slave crawls hundreds of peers and master coordinates multiple slaves Increase degree of concurrency 10

Cruiser l l Using 6 off-the-shelf 1 GHz GNU/Linux boxes, crawl takes 15 min

Cruiser l l Using 6 off-the-shelf 1 GHz GNU/Linux boxes, crawl takes 15 min + 5. 5 hr + 15 min ~ 6 hours Each content crawl takes 10 GB log file containing file name and content hash 11

Dataset l l l Three measurement periods; within each period, take snapshots everyday 6/8/2005

Dataset l l l Three measurement periods; within each period, take snapshots everyday 6/8/2005 -6/18/2005, 8/23/2005 -9/9/2005 and 10/11/2005 -10/21/2005 Examine both short and long timescales 12

Dataset 13

Dataset 13

Sources of unreachable nodes l l l l Firewall Severe network congestion Peer departed

Sources of unreachable nodes l l l l Firewall Severe network congestion Peer departed Not support Browse Host protocol Ultrapeers: depart Leaf peers: depart and firewall Contact 20% peers (~half a million) 14

Problems l Low-bandwidth TCP connection l l File identity l l Some crawls do

Problems l Low-bandwidth TCP connection l l File identity l l Some crawls do not complete after the timeout threshold, as they are sent at extremely low rate File name is not a reliable file identifier; so this work use content hash Post-processing l l More than 100 million distinct files Divide into 7 segments randomly, trim files of less than 10 copies in a segment, combine trimmed back to one 15

Static analysis l l Ratio of free riders Degree of resources sharing among cooperative

Static analysis l l Ratio of free riders Degree of resources sharing among cooperative peers File popularity distribution File type analysis 16

Ratio of free riders l Free riders drop, ratio of ultrapeers is lower, long-lived

Ratio of free riders l Free riders drop, ratio of ultrapeers is lower, long-lived peers slightly higher, # files not strongly correlate 17

Degree of resources sharing among cooperative peers l Distribution of # peers sharing x

Degree of resources sharing among cooperative peers l Distribution of # peers sharing x files – powerlaw distribution 18

Degree of resources sharing among cooperative peers l Distribution of contributed disk space –

Degree of resources sharing among cooperative peers l Distribution of contributed disk space – power -law distribution 19

Degree of resources sharing among cooperative peers l l Correlation not as strong as

Degree of resources sharing among cooperative peers l l Correlation not as strong as previous studies Discernable line with slope 3. 7 MB/file which is typical size of MP 3 audio file 20

File popularity distribution 21

File popularity distribution 21

File type analysis 22

File type analysis 22

File type analysis Previous studies Current studies Music 67. 2% files 79. 2% bytes

File type analysis Previous studies Current studies Music 67. 2% files 79. 2% bytes 67% files 40% bytes Video 2. 1% files 19. 1% bytes 6% files 52. 5% bytes 23

Topological analysis l l Per-file perspective – figure a & b Per-peer perspective –

Topological analysis l l Per-file perspective – figure a & b Per-peer perspective – figure c 24

Topological analysis l Churn (dynamics of peer participation) is dominant factor l l Depart

Topological analysis l Churn (dynamics of peer participation) is dominant factor l l Depart Join Leaf peers become ultrapeers Rapid change in overlay topology prevents formation of topological clustering 25

Dynamics analysis l l l Variations in shared files by individual peers Variations in

Dynamics analysis l l l Variations in shared files by individual peers Variations in popularity of individual files Trends in popularity variations 26

Variations in shared files by individual peers 27

Variations in shared files by individual peers 27

Variations in popularity of individual files l Focus on top 100 and top 1000

Variations in popularity of individual files l Focus on top 100 and top 1000 files 28

Trends in popularity variations l l Track top 10 files across several days (fig

Trends in popularity variations l l Track top 10 files across several days (fig a & b) Over several months (fig c) 29

Conclusion l l l Use parallel crawl to obtain snapshots of peer connectivity and

Conclusion l l l Use parallel crawl to obtain snapshots of peer connectivity and available files Conduct three types of analysis Understand the distribution, correlation and dynamics of available files 30

Summary of findings l l l Free riding significantly drops # shared files and

Summary of findings l l l Free riding significantly drops # shared files and contributed storage space by individual peers follow power-law distribution most peers contribute little disk space (<100 MB) while small # peers contribute very large space (50 -100 GB) Popularity of individual files follow Zipf distribution small # files are extremely popular but majority of files are very unpopular 31

Summary of findings l l Most popular file type is MP 3 file (2/3

Summary of findings l l Most popular file type is MP 3 file (2/3 of all files, 1/3 of all bytes) Popularity and occupied space by video files has tripled over past few years # video files < 1/10 of audio files but occupy 25% more bytes 93% of bytes or 73% of files are multimedia files 32

Summary of findings l l Files are randomly distributed; no strong correlation between the

Summary of findings l l Files are randomly distributed; no strong correlation between the available files at peers that are one, two or three hops apart in overlay topology Shared files by individual slowly change over timescale of days; more popular files experience larger variations in popularity 33