Finding a needle in haystack Facebooks photo storage

Finding a needle in haystack: Facebook's photo storage By: Neha Allangh Seattle University

Contents • • Problem Description Typical Design description Need for new design (HAYSTACK) Haystack’s goal Haystack’s Design Optimization techniques Results Evaluation

Problem Description • Facebook stores over 260 billion images i. e about 20 PB of data Users upload 1 billion new photos each week i. e about 60 TB of data Facebook serves over 1 million images per second at peak Two types of workload for image serving 1. Profile pictures and pictures recently updated 2. Photo albums and older photos How to deal with this much amount of data? • • •

Photo's Popularity

Requirements • High throughput and Low latency • Fault tolerance • Cost effective • Simple design

Typical Design

1. Browser sends an HTTP request to the web browser 2. Web server is responsible for generating the markup for the browser to render 3. For each image there is a URL directing the browser to a location from which to download the data: for popular sites this URL often point to CDN(Content Delivery Network) a. If CDN has the data ->respond immediately b. Else it examines the URL->get from photostorage system->update cache data

Facebook's old NFS based design

NFS Design description • Each photo stored in it's own file on a set of commercial NAS appliances • Photo store servers mount all the volumes exported by NAS appliances over NFS • Photo store serve process HTTP request for images

a. Extracts the volume and full path to the file from URL of the image b. Read the data over NFS c. Return the result to CDN

Problems with NFS Design • Large number of disk operations needed in order to fetch less popular/old photos Upto 10 disk operations needed to fetch a single photo Upto several i/o operations needed to find the correct inode Too much reliance on CDN's which is an expensive approach Key problem is DISK OPERATION • • •

Solution 1. Reduce the directory sizes to 100 of images per directory 2. Reduction from 10 operations per image to 3 operations per image a. Read the directory metadata into memory b. Load the into memory c. Read the file contents 3. Photo storage server explicitly cache file handles returned by NAS appliances

Moving towards Haystack • The existing systems lack the right RAM to disk ratio i. e they do not have enough main memory to hold the file systems metadata • Each photo corresponds to one file and each file requires atleast one inode which is 100 bytes long and some inodes like xfs_inode_t is 536 bytes long So keeping all these heavy inodes in main memory? Feasibility?

Facebook decided to develop it's own storage system!!! HAYSTACK

Haystack Goals • • High throughput and low latency ->Photos should be served quickly to facilitate a good user experience Fault Tolerant ->Users should not experience errors despite inevitable server crashes and hard drive failures

• Cost Effective ->Haystack is less expensive as compared to old NFS design ->Cost per terabyte of usable storage ->Read rate normalized for each terabyte of storage Simple ->Simple design, easy to develop , less time •

Haystack Design Overview • • Use CDN to serve popular images Store multiple photos on single file Arrange them one after “another” Three core components: ->Haystack Directory ->Haystack Cache ->Haystack Store

Haystack’s Design

Haystack’s Directory • • • Maintains mappings from logical to physical volumes ->Mappings used for uploading photos http: //<CDN>/<Cache>/<Machine ID>/<Logical volume, Photo> Balances writes across logical volumes and reads across physical volumes Determines whether the photo request should be handled by the CDN or cache

• It identifies those logical volumes that are read only ->machine becomes read only logical volumes reach there storage capacity ->because of operational reasons

Haystack’s Cache • It functions as an internal CDN ->Receives HTTP requests for photos from CDN and also directly from users browser Cache’s a photo when: ->Request comes directly from the user and not the CDN ->Photo is fetched from a write enabled store machine a. Use of cache to shelter write enabled store machine from reads as photos are most •

heavily accessed soon after they are uploaded b. Haystack performs better when doing either reads or writes

Haystack’s Store • • Each store machine manages multiple physical volumes(holds millions of photos) Each physical volume is assigned to a logical one(redundancy for fault tolerance) Each physical volume is a large file(100 GB) that contains many photos Basic operations: ->Read ->Write ->Delete

Physical Volume Layout • Store machine represents a physical volume as a large file consisting of a superblock followed by a sequence of needles Think of a physical volume as a very large file (100 GB) saved as ‘/haystack <logical volume id>’

• Each needle represents a photo stored in Haystack Uniquely identified by <Offset, Key, Alternate Key, Cookie>

Layout of Haystack Store file

Photo Read • Cache machine requests a photo it supplies the logical volume id, key, alternate key, and cookie Cookie’s value is randomly assigned by and stored in the Directory at the time that the photo is uploaded Used to eliminates attacks aimed at guessing valid URLs for photos

• Store machine looks up the relevant metadata in its in memory mappings. Checks if it is not deleted Seeks to the appropriate offset in the volume file Reads the entire needle from disk Verifies the cookie and the integrity of the data

Photo Write • • • When Uploading a photo into haystack webservers provide: Logical volume id, key, alternate key, cookie and data to store machine Each machine synchronously appends needle images to its physical volume files and update in-memory mappings as needed Volumes are append-only so photos can only be modified by adding an updated needle with the same key and alternate key

If the new needle is written to different logical volume: the Directory updates its application metadata and future requests will never fetch the older version. If same logical volume then append the needle to same physical volume. Duplicated distinguished based on their offsets: highest offset =latest version

Uploading/Writing a photo jj

Photo Delete • • Is a straightforward technique The store machine sets the delete flag in both the in-memory mapping and synchronously in the volume file The space occupied by deleted files is lost momentarily and reclaimed later via compaction

Optimization techniques by haystack store • • Compaction Online operation that reclaims the space used by deleted and duplicate needles Needles are copied into a new file and the new file is replaces the current file Delete pattern Similar to photo views Batch Upload Batch upload of multiple photos

• Index File Store machines maintain an index file for each of their volumes Index files allow a store machine to build its in-memory mappings quickly, shortening restart time Index file is a checkpoint of the in-memory data structures used to locate needle efficiently on disk

Layout of Haystack Index file

Results • • • Our main aim was to store metadata in memory so haystack achieved that goal It’s less expensive as compared to previous designs Haystack overhead On an average each photo needs 10 bytes of memory Each photo is scaled to four photos: same key(64 bits),

Different alternate keys(32 bits) Different data sizes(assume 16 bits) 2 bytes per image in overhead due to hash table(haystack cache) Total for four scaled photos of same image as 40 bytes which is compared to 536 byte xfs_inode_t in Linux

Volume of photo daily traffic

Evaluation for read only machines

Evaluation for write only machines

While comparing the graphs we found that there are 4 times more read per second (on an average) with using haystack as compared to the standard approach.

THANK YOU!