OSLab Mining Block Correlations to Improve Storage Performance

OSLab Mining Block Correlations to Improve Storage Performance ZHENMIN LI, ZHIFENG CHEN and YUAN ZHOU ACM Transactions on Storage May 2005

Outline • INTRODUCTION • BLOCK CORRELATIONS • MINING FOR BLOCK CORRELATIONS • SIMULATION RESULTS • CONCLUSIONS

INTRODUCTION • Storage applications perform only block R/W operations without any indication of access patterns or data semantics. – without knowing the semantic correlations between blocks • Therefore, previous work had to rely on simple patterns to improve performance. 1. 2. 3. • temporal locality spatial locality (sequential) loop references Correlated blocks tend to be accessed relatively close to each other in an access stream. 1. 2. Database : index trees to speed up query performance File System : a file block is correlated to its inode block

INTRODUCTION • Exploring block correlations is very useful for improving the effectiveness of storage : - caching prefetching data layout disk scheduling • It is quite difficult and complex to allow upper levels to inform a storage system about block correlations. • C-Miner, a method which applies a data mining technique to discover block correlations in storage systems. ‐ the first approach to infer block correlations involving multiple blocks

Block Correlation What are block correlations ? direct indirect

Block Correlation Compare to temporal & spatial locality Temporal • Block semantics are more stable than workloads ☞ Sequentiality depend on workloads can change dynamically ☞ Especially lack of bursty deletion and insertion operations • I/O requests may not be always consecutive ☞ Interleaving of requests and transactions Spatial • Many correlations are more complex than spatial locality - file’s blocks : allocated consecutively inode block : not allocated contiguous with its file blocks directory block : allocated separately from the inode blocks

Block Correlation A block correlation may involve more than two blocks Three-block correlation ( a, b ) → c : afbrcf 99% (a)→c: aefxec 30% (b)→c: efbsxsc 10% Real Instance B+ tree sequence scan the leaf nodes

Block Correlation Exploiting Block Correlations Prefetch If a strong correlation exists between blocks, these two blocks can be fetched together from disks whenever one of them is accessed layout data in disks A block can be collocated with its correlated blocks so that they can be fetched together using just one disk access Caching A storage cache can “promote” or “demote” a block after its correlated block is accessed or evicted

Frequent Sequence Mining Clo. Span (Closed Sequential Pattern mining) • The main idea of Clo. Span is to find only closed frequent subsequences • A closed sequence is a subsequence whose support is different from that of its supersequences • Example : D = { abced, abcef, agbch, abijc, aklc } p frequent subsequences (support=4) : { a: 5, b : 4, c : 5, ab : 4, ac : 5, bc : 4, abc : 4 } p closed subsequences : { ac : 5, abc : 4}

C-Miner Basic Mining Algorithm • A frequent subsequence indicates that the involved blocks are frequently accessed together. • But, the original mining algorithm does not consider the gap of a frequent subsequence. xhgurgfercreufhchufsehfaewy • To address this issue, C-Miner specify a maximum distance threshold, denoted as max gap. ☞ All the uninteresting frequent sequences whose gaps are larger than the threshold are filtered out.

C-Miner Preprocessing • Clo. Span are designed to discover patterns for a sequence database rather than a single long sequence of time-series. Overlapped cutting ☞ Generates more sequences than nonoverlapped cutting Nonoverlapped cutting ☞ Lead to loss of frequent subsequences that are split into two or more sequences ( set cutting window size �large )

C-Miner Core Algorithm - Stage 1 • Generating a candidate set of frequent Optimization 1) 2) If a sequence is frequent, all of its subsequences are frequent L 1 {a, b, c}; L 2 {ab, ac, bc}; L 3’ = L 2 × L 1 {abc, abb, abc, aca, acb, acc, bca, bcb, bcc} Each sequence in L 2 is concatenated with only the frequent items in its suffix database Dab add abc to L 3’ {ced, cef, ch, ijc},

C-Miner Core Algorithm - Stage 2 • Pruning the non-closed subsequences from the candidate set. Optimization 1) Checks frequent by searching the suffix database Ds L 3’ = {abc, abb, abc, aca, acb, acc, bca, bcb, bcc} s Ds min_sup L 3 = { } 2) Determine whethere are new closed patterns in search subspaces and stop checking those unpromising subspaces. bxxx 0% axxbxxx 100 % → Not closed

C-Miner Generating Association Rules • Break the frequent sequences into association rules The rule of abc : { a → b, a → c, b → c, ab → c } • Constrains the length of a rule to limit the number of rules Confidence of Rules • For each association rule, we need to evaluate its accuracy. { a: 5, ab : 4} • a → b, confidence = 80% To filter out the rules with low confidence (min_conf ).

C-Miner* Mining from a Single Long Sequence • When C-Miner* counts the support for a subsequence beginning with item α, it only scans all corresponding lookahead windows beginning with the same item α • C-Miner* has better accuracy and efficiency than both the nonoverlapped and overlapped cutting methods

SIMULATION RESULTS Evaluation Methodology Simulator - Disk. Sim Cache. Sim – LRU 10, 000 -rev/min IBM Ultrastar 36 Z 15 Cello-92, Cello-96 and Cello-99 - Workstation at HP Lab - sequential access patterns Traces TPC-C OLTP - TPC-H - Method Microsoft SQL Server via SAN TPC-C benchmark a large financial institution Microsoft SQL Server via SAN sequential accesses patterns First half part of the trace - Mine block correlations. The rest - Evaluated the performance - Correlation-directed prefetching - Data layout

SIMULATION RESULTS Data Mining Overhead confidence ≥ 10%

SIMULATION RESULTS Correlation-Directed Prefetching and Disk Layout CDP = correlation-directed prefetching

SIMULATION RESULTS Correlation-Directed Prefetching and Disk Layout

CONCLUSIONS • C-Miner, a novel algorithm that uses data mining techniques to mine • In order to increase the accuracy and time efficiency of C-Miner, further propose a new algorithm, called C-Miner* – without cutting • Experimental results have shown that correlation-directed prefetching and data layout can improve I/O average response time : ‐ 12– 30% compared to no-prefetching ‐ 7– 25% compared to the sequential prefetching access sequences to infer block correlations.