SAFE StructureAware File and Email Deduplication for Cloudbased

SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems Daehee Kim, Sejun Song, Baek-Young Choi University of Missouri-Kansas City

Cloud Storage – Dropbox, Google drive, … Server : large storage consumption Anywhere, Anytime Network : High network bandwidth consumption i. e. Remote Backup . . Client : High uploading overhead …. . Employee Individual …. Employee Sales Marketing

Data deduplication n Deduplication granularity ¨ File-level ¨ Sub n n n file-level Fixed-size chunk Variable-size chunk Deduplication location ¨ Server-based n Traditionally on the high capacity servers ¨ Client-based n Limited by the client capacity

File-Level (File-Level Deduplication) control data storage index unique index X index duplicate Index table

Sub-File Level : Fixed Size Chunk (Fixed Size Block Deduplication) e. g. granularity : 15 byte fixed size boundary File 1 boundary nice people, good papers, and good conference, …… nice people, go Offset shifting problem File 2 boundary od papers, and good conference …… No redundancies found welcome, nice people, good papers, and good conference, …… welcome, nice p eople, good pap ers, and good c ……

Sub-File Level : Variable Size Chunk (Variable Size Block Deduplication) Based on content, not fixed offset n e. g. matching pattern : “go” boundary File 1 boundary nice people, good papers, and good conference, …… nice people, go od papers, and go …… = File 2 welcome, nice people, good papers, and good conference, …… welcome, nice people, go od papers, and go ……

Deduplication : Comparisons Good for client-based Good for server-based Deduplication ratio File-level < Fixed size << Variable size Processing time File-level < Fixed size <<<< Variable size Index overhead File-level << Fixed size n better worse Variable size Current cloud storage systems Client-based ¨ Just. Cloud, Mozy : file deduplication ¨ Dropbox : large fixed size block deduplication (4 MB) ¨

Objective n Develop an efficient client deduplication that achieves ¨ High deduplication ratio Low network traffic ¨ Low processing time ¨ Less index overhead ¨

Outline Motivation, Background, and Goal n Observations and Approach n Design n Evaluation n Conclusion n

Observations n structured file can be decomposed to various objects ¨ Fast decomposition without shifting problem ¨ e. g. compressed files ( zip, rar, . . ), document files (pdf, doc, ppt, docx, pptx), emails email [ Example ] attachments meta body (text) text pdf <</Type/Page/ …>> docx images Image object <</Type/. . Image/. . Filter/. . Length>> <stream>Encoded image<endstream> Text object <</Filter/. . /Length >> <stream>Encoded text<endstream> … Page

Observations n Large number of structured files exist in cloud-based storage systems [ dataset ]

Our Approach (SAFE) n Apply file-level deduplication for redundant files ¨ Speed n up and small index sizes Apply object-based deduplication for structured files ¨ Decompose a file into objects ¨ Find redundancies based on decomposed objects. ¨ Combine small sized meta data into an object (to reduce index sizes)

Outline Motivation, Background, and Goal n Observations and Approach n Design n Evaluation n Conclusion n

SAFE Architecture Emails meta Email parser pdf img Files Redundant file File-level dedup end unique file Unstructured file Structured? Structured file Structure Library File parser All object indexes objects Object-level dedup objects (index, object) Object manager Unique object indexes objects Store manager

SAFE in Cloud Storage SAFE file-level dedup : : Indexes (objects) object-level dedup Indexes (unique objects) unique objects Server Client

Outline Motivation, Background, and Goal n Observations and Approach n Design n Evaluation n Conclusion n

Setup n Collected real data sets ¨ Structured files (docx, pptx, and pdf) ¨ From file system and emails of five graduate students in the same department n n file system : 4 GB, emails : 2. 5 GB Compared deduplications ¨ File-level (like Just. Cloud, Mozy) ¨ Fixed block (4 MB, like Dropbox) ¨ Variable block (8 KB average chunk size)

Evaluation Metrics n Performance ¨ Deduplication n n Space savings by removing redundancies ( (Input. Data – Consumed. Storage) / Input. Data) * 100 ¨ Network n n n ratio Traffic Size of data transferred to a storage over network Byte Overhead ¨ Processing ¨ Relative processing time to File-Level ¨ Index ¨ time size Relative index size per File-Level

Deduplication Ratio n is about 30% to 60% in SAFE. n is 2 times higher in SAFE than in “File-level” is as good in SAFE as variable size block deduplication (Block-V) for email datasets is even higher in SAFE than Block-V for file system datasets n n x 1. 5 x 2 File system datasets Email datasets

Network Traffic n n is the lowest in SAFE for both datasets is 15% and 30% lower in SAFE than file-level deduplication (File) and fixed size block deduplication (Block-F) for both data sets. 15% File system datasets 30% Email datasets

Processing Time n is hundreds times faster in SAFE than in Block-V n is as fast in SAFE as in File-level hundreds times File system datasets hundreds times Email datasets

n Index Size n Is proportional to the number of unique blocks (40 B per index) n n Is 2 to 3 times less in SAFE (1. 3 MB) than Block-V (3. 7 MB) n n i. e. for 4000 emails, index sizes are 0. 1 MB (file-level) and 1. 3 MB (SAFE) Block-V has 8 KB block size in average Is 2 times more in file system than email datasets n n SAFE has multiple decomposed objects for a file i. e. file system dataset has more pdf files (pdf file can be decomposed into more objects than docx) File system datasets Email datasets

Conclusions n Developed an efficient structure-aware -based deduplication (SAFE) High deduplication ratio: as good as Block-V ¨ Low network traffic: as good as Block-V ¨ Low processing time ¨ ¨ ¨ Less index overhead ¨ n hundreds times than Block-V 2 ~ 3 times less than Block-V Future work ¨ Extend to incorporate more structured file types client

Thank you! Questions? {daehee. kim, sjsong, choiby} @umkc. edu
- Slides: 24