Blue DBM An Appliance for Big Data Analytics
Blue. DBM: An Appliance for Big Data Analytics Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+ John Ankcorn+ Myron King+ Shuotao Xu* Arvind* *MIT Computer Science and Artificial Intelligence Laboratory +Quanta Research Cambridge ISCA 2015, Portland, OR. June 15, 2015 This work is funded by Quanta, Samsung and Lincoln Laboratory. We also thank Xilinx for their hardware and expertise donations. 1
Big data analytics Analysis of previously unimaginable amount of data can provide deep insight n n Google has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC) Analyzing personal genome can determine predisposition to diseases Social network chatter analysis can identify political revolutions before newspapers Scientific datasets can be mined to extract accurate models Likely to be the biggest economic driver for the IT industry for the next decade 2
A currently popular solution: RAM Cloud Cluster of machines with large DRAM capacity and fast interconnect + Fastest as long as data fits in DRAM - Power hungry and expensive - Performance drops when data doesn’t fit in DRAM What if enough DRAM isn’t affordable? Flash-based solutions may be a better alternative + Faster than Disk, cheaper than DRAM + Lower power consumption than both - Legacy storage access interface is burdening - Slower than DRAM 3
Related work Use of flash n n n SSDs, Fusion. IO, Purestorage Zetascale SSD for database buffer pool and metadata [SIGMOD 2008], [IJCA 2013] Networks n n Quick. SAN [ISCA 2013] Hadoop/Spark on Infiniband RDMA [SC 2012] Accelerators n n n Smart. SSD[SIGMOD 2013], Ibex[VLDB 2014] Catapult[ISCA 2014] GPUs 4
Latency profile of distributed flash-based analytics Distributed processing involves many system components n n Flash device access Storage software (OS, FTL, …) Network interface (10 g. E, Infiniband, …) Actual processing Flash Access 75 μs Storage Software 100 μs Network 20 μs Processing 50~100 μs 100~1000 μs 20~1000 μs … Latency is additive 5
Latency profile of distributed flash-based analytics Architectural modifications can remove unnecessary overhead n n Near-storage processing Cross-layer optimization of flash management software * Dedicated storage area network Accelerator Flash Access 75 μs Storage Software 100 μs 50~100 μs < 20μs 100~1000 … μs Network 20 μs Processing 20~1000 μs … Difficult to explore using flash packaged as off-the-shelf SSDs 6
To VC 707 HPC FMC PORT Custom flash card had to be built Flash Artix 7 FPGA Flash Network Ports Bus 0 Bus 1 Bus 2 Bus 3 Flash Flash Array (on both side) 7
Blue. DBM: Platform with near-storage processing and inter-controller networks 20 24 -core Xeon Servers 20 Blue. DBM Storage devices n 1 TB flash storage n x 4 20 Gbps controller network n Xilinx VC 707 n 2 GB/s PCIe 8
Blue. DBM: Platform with near-storage processing and inter-controller networks 1 of 2 Racks (10 Nodes) Blue. DBM Storage Device 20 24 -core Xeon Servers 20 Blue. DBM Storage devices n 1 TB flash storage n x 4 20 Gbps controller network n Xilinx VC 707 n 2 GB/s PCIe 9
Blue. DBM node architecture Lightweight flash management with very low overhead Flash Device In-Storage Processor Network Interface Flash Controller n Adds almost no latency Custom network protocol with n ECC support low latency/high bandwidth n x 4 20 Gbps links at 0. 5 us latency Software has very low level n Virtual channels with flow control access to flash storage n High information can be used No timelevel to go into gritty details! n for low level management FTL implemented inside file system PCIe Host Server 10
Userspace Blue. DBM software view Hardware-assisted Applications Kernelspace File System Connectal Proxy Generated by Connectal* Block Device Driver FPGA Connectal (By Quanta) Flash Ctrl HW Accelerator Connectal Wrapper Accelerator Manager Network Interface NAND Flash Blue. DBM provides a generic file system interface as well as an accelerator-specific interface (Aided by Connectal) 11
Power consumption is low Component Power (Watts) VC 707 30 Flash Board (x 2) 10 Storage Device Total 40 Storage device power consumption is a very conservative estimate Component Power (Watts) Storage Device 40 Xeon Server 200+ Node Total 240+ GPU-based accelerator will double the power 12
Applications Content-based image search * n Faster flash with accelerators as replacement for DRAM -based systems Blue. Cache – An accelerated memcached* n Dedicated network and accelerated caching systems with larger capacity Graph analytics n Benefits of lower latency access into distributed flash for computation on large graphs * Results obtained since the paper submission 13
Content-based image retrieval Takes a query image and returns similar images in a dataset of tens of million pictures Image similarity is determined by measuring the distance between histograms of each image n n Histogram is generated using RGB, HSV, “edgeness”, etc Better algorithms are available! 14
Image search accelerator Sang woo Jun, Chanwoo Chung Flash FPGA Sobel Filter Flash Controller Histogram Generator Query Histogram Comparator Software 15
Image query performance without sampling Blue. DBM + FPGA CPU Bottleneck Blue. DBM + CPU Off-the shelf M. 2. SSD Faster flash with acceleration can perform at DRAM speed 16
Sampling to improve performance Intelligent sampling methods (e. g. , Locality Sensitive Hashing) improves performance by dramatically reducing the search space n But introduces random access pattern Locality-sensitive hash table Data accesses corresponding to a single hash table entry results in a lot of random accesses 17
Image query performance with sampling A disk based system cannot take advantage of the reduced search space 18
memcached service A distributed in-memory key-value store n n n caches DB results indexed by query strings Accessed via socket communication Uses system DRAM for caching (~256 GB) Extensively used by database-driven websites n Facebook, Flicker, Twitter, Wikipedia, Youtube … Brower/ Mobile Apps Application Servers Memcached Response Return data Memcached request Web request Memcached Servers Networking contributes to 90% overhead 19
Bluecache: Accelerated memcached service Shuotao Xu Inter-controller network PCIe web server Bluecache accelerator … 1 TB Flash Controller Bluecache accelerator PCIe web server Network Bluecache accelerator Flash Controller Network Flash Controller 1 TB Flash Network 1 TB Flash PCIe web server Memcached server implemented in hardware n n Hashing and flash management implemented in FPGA 1 TB hardware managed flash cache per node Hardware server accessed via local PCIe Direct network between hardware 20
Effect of architecture modification (no flash, only DRAM) 4500 4012 Throughput (KOps per Second) 3500 11 X Performance 3000 2500 2000 1500 1000 500 357 273 Local Memcached Remote Memcached 0 Bluecache Get Operations ( Key Size = 64 Bytes, Value Size = 64 Bytes) PCIe DMA and inter-controller network reduces access overhead FPGA acceleration of memcached is effective 21
High cache-hit rate outweighs slow flashaccesses (small DRAM vs. large Flash) Throughput (KOps per seconds) 350 Key size = 64 Bytes, Value size = 8 K Bytes 5 ms penalty per cache miss * Assuming no cache misses for Bluecache 300 250 200 [SERIES NAME] 150 100 [SERIES NAME] 50 0 0 5 10 15 20 25 30 35 40 45 50 Bluecache starts performing better at 5% miss A “sweet spot” for large flash caches exist 22
Graph traversal Very latency-bound problem, because often cannot predict the next node to visit n Beneficial to reduce latency by moving computation closer to data Flash 1 Flash 2 Flash 3 In-Store Processor Host 1 Host 2 Host 3 23
Graph traversal performance Nodes traversed per second 18000 DRAM Flash 16000 * Used fast Blue. DBM network even for separate network for fairness 14000 12000 10000 8000 6000 4000 2000 0 Software+DRAM Software + Separate Network Software + Controller Network Accelerator + Controller Network Flash based system can achieve comparable performance with a much smaller cluster 24
Other potential applications Genomics Deep machine learning Complex graph analytics Platform acceleration n Spark, MATLAB, Sci. DB, … Suggestions and collaboration are welcome! 25
Conclusion Fast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big Data Reducing access latency for distributed storage require architectural modifications, including in -storage processors and fast storage networks Flash-based analytics hold a lot of promise, and we plan to continue demonstrating more application acceleration Thank you 26
27
Near-Data Accelerator is Preferable NIC Flash FPGA CPU DRAM Accelerator NIC Flash CPU DRAM FPGA Accelerator Traditional Approach Hardware & software latencies are additive Motherboard CPU NIC Flash FPGA DRAM CPU DRAM Blue. DBM FPGA Motherboard 28
VC 707 PCIe DRAM Artix 7 Network Cable Virtex 7 Network Ports Flash 29
- Slides: 29