Data Science at Facebook Itamar Rosenn Eric Sun

Data Science at Facebook Itamar Rosenn Eric Sun 5/4/09

Facebook Data ▪ ▪ ▪ Social Graph ▪ 200 M+ active users ▪ 100 M+ users come to site each day ▪ several hundred thousand new users join each day ▪ hundreds of dimensions per user (numerical, categorical, text) ▪ average user has over 120 friends ▪ friendships on Facebook span many different types of relationships Social Behavior ▪ Actions: users interact with hundreds of thousands of applications, on and off the site ▪ Interactions: users interact directly with each other via over 100 distinct types of events Social Content ▪ Photos, Status Updates, Platform Application Content, Events, Posts, Videos, Notes, etc. . .

Managing Data at Scale ▪ ▪ Solution: Hadoop + Hive ▪ HDFS / Hadoop (Map. Reduce in Java) ▪ Meta. Store (metadata management) ▪ Hive. QL (SQL-like query language on top of Hadoop + Meta. Store) Data Scale ▪ More than 1 PB raw capacity in largest HDFS / Hadoop cluster ▪ Over 2 TB uncompressed data collected each day ▪ Dozens of TB worth of data read / written each day via Hadoop + Hive

Data Science - What We Do Behavioral Analysis Data-Driven Systems Product Health Metrics Launch Evaluations Growth Modeling User Churn Modeling Production Incentives Content Diffusion Ad CTR Prediction PYMK Search Ranking Highlights Hive Hadoop Data Infrastructure

Data Science – Who We Are Dennis Decoste Thomas Lento Lee Byron Danny Ferrante Roddy Lindsay Cameron Marlow Ravi Grover Itamar Rosenn Alex Smith Venky Iyer James Mayfield

Maintained Relationships on Facebook ▪ Question: is Facebook increasing the size of people’s personal networks? ▪ Task: ▪ the types of relationships people maintain on the site ▪ the relative size of these groups

Types of Relationships People you know ▪ Facebook friends = people you’ve met at some point in life ▪ Researchers have estimated this number to be somewhere between 300 and 3, 000. (Gladwell, Killworth) Communication network ▪ Individuals with whom you communicate on a regular basis ▪ Includes your core support network, which may be as low as 3 people ▪ Kossinets and Watts observed communication network size of 10 -20 Maintained relationships ▪ Social technologies like Newsfeed or RSS readers allow you to keep up with the things that people you know are doing ▪ This information consumption is a form of relationship management, as it can lead to direct

Measuring Network Size on Facebook Examine the relationships of a random user sample over 30 days on the site. We defined networks in 4 ways: All friends ▪ The largest representation of a person’s network is the set of people they have verified as friends. Reciprocal communication ▪ The number of friends with whom the user had reciprocal exchanges via messages, wall posts, or comments. This provides a measure of the user’s core network. One-way communication ▪ The number of friends to whom the user has reached out via messages, wall posts, or comments. Maintained relationships ▪ The number of friends whose Newsfeed stories the user has clicked on, or whose profiles the user has visited at least twice

Findings ▪ As a function of the # of friends a user has, she is passively engaging with 2 to 2. 5 more people than with whom she directly communicates

Systemic Effects ▪ The stark constrast between these networks shows the effect of technologies like Newsfeed.

Content Production among New Users ▪ Mission: Give people the power to share and make the world more open and connected. ▪ Question: What mechanisms lead Facebook newcomers to share content on the site?

Content Production In new users’ first two weeks: ▪ 45% upload a photo ▪ 41% use a 3 rd-party app ▪ 30% send a private message ▪ 27% compose a status update ▪ 22% write on a friend’s wall

Production Incentives Hypotheses ▪ ▪ H 1: Newcomers who receive more feedback on their initial content will go on to contribute more content. H 2: Newcomers whose initial content receives greater distribution will go on to produce more content. H 3: Social learning: Newcomers whose friends share more content will go on to produce more content themselves. H 4: Singling out: Newcomers who are singled out in content that their friends produce will go on to produce more content themselves.

Method Quantitative ▪ Selected two cohorts:  Nov. 5, 2007 (N= 347, 403)  Mar. 3, 2008 (N=254, 603) ▪ Observed activity in their first two weeks ▪ Predicted how many photos they would upload between third and fifteenth week on Facebook Qualitative ▪ 40 -minute semi-structured interviews with seven new users ▪ Recorded audio/video and screen ▪ Asked about typical uses of facebook, content production, social norms, privacy

Features Independent Variables Controls H 1. Feedback ▪ Age ▪ Gender ▪ Number of friends ▪ Comments received H 2. Distribution ▪ # of times content was viewed in Newsfeed ▪ Total pages viewed ▪ # of friends who viewed content in Newsfeed ▪ Initial engagement with photos: H 3. Social Learning ▪ # of photos uploaded ▪ Number of friends’ photos seen ▪ # of photos viewed ▪ H 4. Singling Out ▪ Photo tags created ▪ Number of times tagged ▪ Photo comments written

Results Model 1 – Early Uploaders Intercept Model 2 - Everyone 1. 2 Intercept 1. 9 Controls Coefficient % change from int. Age (in years) -0. 01 -1. 0% *** Age (in years) -0. 01 -0. 7% *** Male (0/1) 0. 48 +39. 3% *** Male (0/1) 0. 84 +79. 6% *** Female (0/1) 1. 21 +131. 2% *** Female (0/1) 1. 43 +169. 8% *** Pages viewed + 0. 24 +18. 4% *** Pages viewed + -0. 02 -1. 6% *** Photo pages viewed + 2. 80 +597. 4% *** Photo pages viewed + 2. 35 +408. 3% *** Photo comments made 0. 15 +11. 2% *** Photo comments made 0. 24 +17. 7% *** Photo tags created 0. 10 +6. 9% *** Photo tags created 0. 17 +12. 6% *** Photos uploaded 0. 30 +22. 8% *** Early-uploader (0/1) 0. 39 +30. 6% *** Independent Vars Coefficient % change from int. Comments received (0/1) 0. 09 +6. 2% *** 0. 15 +10. 7% *** Photo views received 0. 04 +2. 6% *** Photo stories seen X earlyuploader 0. 03 +2. 2% *** Photo stories seen 0. 09 +6. 1% *** Photo stories seen X non-early-uploader Photo tags received (0/1) 0. 03 +2. 1% (ns) Photo tags received X early-uploader (0/1) -0. 05 -3. 6% (ns) Photo tags received X non-early-uploader (0/1) 0. 10 +7. 2% ***

Summary of Results Hypothesis Early-uploaders Non-earlyuploaders H 1. Feedback Support N/A H 2. Distribution Modest Support N/A H 3. Social learning Support H 4. Singling out No Support ▪ We learn from our friends. If our friends engage with photos, we do too. Social learning is the main lever for content production. ▪ For new users already uploading photos feedback is associated with increased content production, and distribution is marginally important.  

Modeling Contagion Through Newsfeed ▪ How do ideas spread through a social network? ▪ Use Facebook Pages to model diffusion patterns ▪ Compare results with existing models of diffusion ▪ Show Facebook advertising campaigns may be more successful than off-Facebook advertising campaigns due to Facebook’s interconnectedness and diffusion properties. ▪ Note: Research based on “old” Facebook (pre-March 2009) ▪ Still relevant: first empirical analysis of large-scale collisions of short chains

Theory of the Influentials ▪ Old Theory: it’s all about the “influentials” (Malcolm Gladwell, etc. ) ▪ Idea: reach a tiny group of Influential people, and you’ll reach everyone else through them for free ▪ $1+ billion/year spent on word-of-mouth campaigns targeting Influentials; amount is growing 36% per year (Marketing. VOX)

Contagion Theory ▪ Duncan Watts: Anyone can be an influencer. ▪ Ideas don’t spread via influentials. Instead, ideas spread like viruses: either you’re susceptible, or you’re not ▪ Success depends not on how persuasive the early adopter(s) are, but whether everyone else is easily persuaded.

How Do Ideas Spread on Facebook? ▪ News Feed allows for efficient diffusion of ideas ▪ Facebook’s Pages product is one of the most viral features of the site. ▪ People may see multiple friends fan a Page in a single Feed story, so a node in the graph can have multiple parents Alice fans a Page Bob sees Alice’s action on his News Feed; Bob fans the Page as well Charlie sees Alice’s action on his News Feed; Charlie fans the Page as well Chain of Length 1

Large-Scale Result: Large Connected Trees of Diffusion ▪ Diffusion chain for Stripy, a cartoon popular in Bosnia (blue) & Slovenia (yellow). Croatia (green) has yet to find its connecting bridge.

Large Connected Clusters ▪ Often, the vast majority of fans can be connected into one cluster; sometimes over 90% of the fans for one particular Page can be connected. ▪ Example: On 8/21/08, 71, 090 of 96, 922 fans of the Nastia Liukin Page (73. 3%) were in one connected cluster. ▪ For Pages created after 7/1/08, the median Page had 69. 48% of its Fans in one connected cluster as of 8/19/08.

How Do These Large Clusters Come About? • Are these large clusters started by “one guy”? ▪ No: across all Pages of meaningful size (>1000 Fans), 14. 8% of the Fans in the biggest cluster were “start points. ” ▪ The variability in this percentage becomes very small as # fans increases ▪ The average node in the biggest cluster is connected to 2. 899 others. • Large clusters are formed when many long chains of diffusion merge together.

Diffusion Chains on Facebook vs. Real Life • The connected nature of Facebook (combined with easy methods of communication) makes long diffusion chains possible. ▪ In word-of-mouth studies of information propagation, most people hear of an idea from 1 person and pass it on to 1 other person ▪ ▪ Only 38% of paths involve at least four individuals (Brown & Reingen 1987) On Facebook, 86. 4% of paths of Page diffusion involve at least 4 individuals

How are Long Diffusion Chains Created? • Goal: test whether the Influentials theory or the Contagion theory is more applicable to Facebook ▪ Attempt to predict size of diffusion chains that a particular user will create using characteristics of the user and/or the Page. ▪ If size can be predicted, we can then identify the most influential users.

Data ▪ Data consists of all the associations (actor follower) for a representative selection of Pages. ▪ Pages were at least 40 days old and had at least 5, 000 fans

Prediction Model Response: max_chain_length Predictors: ▪ gender ▪ log age ▪ log Facebook_age ▪ log feed_exposure (# friends who saw News Feed story) ▪ log friend_count ▪ log activity_count (wall posts + messages sent + photos added) ▪ log popularity (controls for News Feed exposure via Coefficient) Method: zero-inflated negative binomial regression

Results • Only consistent coefficient is on feed_exposure (# friends who saw News Feed story). ▪ Coefficient hovers around 1: if News Feed publishes a user’s action to 1% more people, we expect a 1% longer max_chain • Implies that friend_count is not realistically meaningful. ▪ After controlling for distribution and popularity, neither demographic characteristics nor number of Facebook friend seems to play an important role in the prediction of maximum diffucion chain length.

Conclusions • Facebook News Feed enables long-lasting chains of diffusion that may reach many more people than reallife diffusion chains. • The Facebook network is very connected: ideas with good receptiveness will attract wide, long connected clusters. • Long chains are not a function of Facebook age, activity, users’ demographics, or even # of friends: it’s only related to exposure.

Contact www. facebook. com/data itamar@facebook. com esun@facebook. com