Split Snapshots and Skippy Indexing Long Live the

  • Slides: 31
Download presentation
Split Snapshots and Skippy Indexing: Long Live the Past! Ross Shaull <rshaull@cs. brandeis. edu>

Split Snapshots and Skippy Indexing: Long Live the Past! Ross Shaull <rshaull@cs. brandeis. edu> Liuba Shrira <liuba@cs. brandeis. edu> Brandeis University

Our Idea of a Snapshot • A window to the past in a storage

Our Idea of a Snapshot • A window to the past in a storage system • Access data as it was at time snapshot was requested • System-wide • Snapshots may be kept forever – I. e. , “long-lived” snapshots • Snapshots are consistent – Whatever that means… • High frequency (up to CDP)

Why Take Snapshots? • Fix operator errors • Auditing – When did Bob’s salary

Why Take Snapshots? • Fix operator errors • Auditing – When did Bob’s salary change, and who made the changes? • Analysis – How much capital was tied up in blue shirts at the beginning of this fiscal year? • We don’t necessarily know what will be interesting in the future

BITE • Give the storage system a new capability: Back-in-Time Execution • Run read-only

BITE • Give the storage system a new capability: Back-in-Time Execution • Run read-only code against current state and any snapshot • After issuing request for BITE, no special code required for accessing data in the snapshot

Other Approaches: Databases • Immortal. DB, Time-Split BTree (Lomet) – Reorganizes current state –

Other Approaches: Databases • Immortal. DB, Time-Split BTree (Lomet) – Reorganizes current state – Complex • Snapshot isolation (Postgre. SQL, Oracle) – Extension to transactions – Only for recent past • Oracle Flash. Back – Page-level copy of recent past (not forever) – Interface seems similar to BITE

Other Approaches: FS • WAFL (Hitz), ext 3 cow (Peterson) – Limited on-disk locality

Other Approaches: FS • WAFL (Hitz), ext 3 cow (Peterson) – Limited on-disk locality – Application-level consistency a challenge • VSS (Sankaran) – Blocks disk requests – Suitable for backup-type frequency

A Different Approach • Goals: – – Avoid declustering current state Don’t change how

A Different Approach • Goals: – – Avoid declustering current state Don’t change how current state is accessed Application requests snapshot Snapshots are “on-line” (not in warehouse) • Split Snapshots – Copy past out incrementally – Snapshots available through virtualized buffer manager

Our Storage System Model • A “database” – Has transactions – Has recovery log

Our Storage System Model • A “database” – Has transactions – Has recovery log – Organizes data in pages on disk

Our Consistency Model • Crash consistency – Imagine that a snapshot is declared, but

Our Consistency Model • Crash consistency – Imagine that a snapshot is declared, but then before any modifications can be made, the system crashes – After restart, recovery kicks in and the current state is restored to *some* consistent point – All snapshots will have this same consistency guarantee after a crash

Our Storage System Model Page Table P 1 Address X P 2 Address Y

Our Storage System Model Page Table P 1 Address X P 2 Address Y … Cache P 1 Find Root Search for R P 3 Pn ISnapshot want record Now R Find Table Disk P 1 … Access Methods … Database Return R Application

Retaining the Past Versus

Retaining the Past Versus

Copy-on-Write (COW) Snapshot Page Table “S” The old page table became the Snapshot page

Copy-on-Write (COW) Snapshot Page Table “S” The old page table became the Snapshot page table P 1 P 2 Page Table Operations: P 1 Snapshot “S” P 2 Modify P 1

Expensive to update P 2 in both page tables Split-COW Page Table P 1

Expensive to update P 2 in both page tables Split-COW Page Table P 1 P 2 SPT(S) SPT(S+1) P 1 P 2

What’s next 1. How to manage the metadata? 2. How will snapshot pages be

What’s next 1. How to manage the metadata? 2. How will snapshot pages be accessed? 3. Can we be non-disruptive?

Metadata Solution • Metadata (page tables) created incrementally • Keeping many SPTs costly •

Metadata Solution • Metadata (page tables) created incrementally • Keeping many SPTs costly • Instead, write “mappings” into log • Materialize SPT on-demand

Snap 6 P 2 P 1 P 2 Snap 5 Snap 4 Snap 3

Snap 6 P 2 P 1 P 2 Snap 5 Snap 4 Snap 3 Start P 1 P 2 P 1 Snap 2 Maplog Snap 1 Maplog • Mappings created incrementally • Added to append-only log • Start points to first mapping created after a snapshot is declared P 1 P 3

Snap 6 P 2 P 1 P 2 Snap 5 Snap 4 Snap 3

Snap 6 P 2 P 1 P 2 Snap 5 Snap 4 Snap 3 Start P 1 P 2 P 1 Snap 2 Maplog Snap 1 Maplog • Materialize SPT with scan • Scan for SPT(S) begins at Start(S) • Notice that we read some mappings that we do not need P 1 P 3

Cost of Scanning Maplog • Let overwrite cycle length L be the number of

Cost of Scanning Maplog • Let overwrite cycle length L be the number of page updates required to overwrite entire database • Maplog scan cannot be longer than overwrite cycle • Let N be the number of pages in the database • For a uniformly random workload, L N ln N (by the “coupon collector’s waiting time” problem) • Skew in the update workload lengthens overwrite cycle • Skew of 80/20 (80% of updates to 20% of pages) increases L by a factor of 4 Skew hurts

Skippy Level 1 • Copy first-encountered mapping (FEM) within node to next level P

Skippy Level 1 • Copy first-encountered mapping (FEM) within node to next level P 1 P 2 P 1 P 3 P 1 P 2 P 1 P 1 P 2 Copies Maplog Snap 6 Snap 5 Snap 4 Snap 3 Snap 2 Start Snap 1 Pointers P 1 P 3

Cut redundant mapping count �in half Skippy Start Snap 6 P 2 P 1

Cut redundant mapping count �in half Skippy Start Snap 6 P 2 P 1 P 2 Snap 5 P 1 P 2 P 1 Snap 4 Maplog Snap 3 P 1 P 3 Snap 2 P 1 P 2 P 1 Snap 1 Skippy Level 1 P 3

K-Level Skippy • Can eliminate effect of skew — or more • Enables ad-hoc,

K-Level Skippy • Can eliminate effect of skew — or more • Enables ad-hoc, on-line access to snapshots, whether they are old or young Skew # Skippy Levels Time to Materialize SPT (s) 50/50 0 13. 8 80/20 0 19. 0 1 15. 8 2 14. 7 3 13. 9 0 33. 3 1 6. 69 99/1

Accessing Snapshots • Transparent to layers above cache • Indirection layer to redirect page

Accessing Snapshots • Transparent to layers above cache • Indirection layer to redirect page requests from a BITE transaction into the snapstore Cache Read Current BITE State P 1 P 2 P 2 P 1 P 2

Non-Disruptiveness • Can we create Skippy and COW prestates without disrupting the current state?

Non-Disruptiveness • Can we create Skippy and COW prestates without disrupting the current state? • Key idea: – Leverage recovery to defer all snapshotrelated writes – Write snapshot data in background to secondary disk

Implementation • BDB 4. 6. 21 • Page cache augmented – COWs write-locked pages

Implementation • BDB 4. 6. 21 • Page cache augmented – COWs write-locked pages – Trickle COW’d pages out over time • Leverage recovery – Metadata created in-memory at transaction commit time, but only written at checkpoint time – After crash, snapshot pages and metadata can be recovered in one log pass • Costs – Snapshot log record – Extra memory – Longer checkpoints

Early Disruptiveness Results • Single-threaded updating workload of 100, 000 transactions • 66 M

Early Disruptiveness Results • Single-threaded updating workload of 100, 000 transactions • 66 M database • We can retain a snapshot after every transaction for a 6– 8% penalty to writers • Tests with readers show little impact on sequential scans (not depicted)

Paper Trail • Upcoming poster and short paper at ICDE 08 • “Skippy: a

Paper Trail • Upcoming poster and short paper at ICDE 08 • “Skippy: a New Snapshot Indexing Method for Time Travel in the Storage Manager” to appear in SIGMOD 08 • Poster and workshop talks – NEDBDay 08, SYSTOR 08

Questions?

Questions?

Backups…

Backups…

Recovery Sketch 1 • Snapshots are crash consistent • Must recover data and metadata

Recovery Sketch 1 • Snapshots are crash consistent • Must recover data and metadata for all snapshots since last checkpoint • Pages might have been trickled, so must truncate snapstore back to last mapping before previous checkpoint • We require only that a snapshot log record be forced into the log with a group commit, no other data/metadata must be logged until checkpoint.

Recovery Sketch 2 • Walk backward through WAL, applying UNDOs • When snapshot record

Recovery Sketch 2 • Walk backward through WAL, applying UNDOs • When snapshot record is encountered, copy the “dirty” pages and create a mapping • Trouble is that snapshots can be concurrent with transactions • Cope with this by “COWing” a page when an UNDO for a different transaction is applied to that page

The Future • Sometimes we want to scrub the past – Running out of

The Future • Sometimes we want to scrub the past – Running out of space? – Retention windows for SOX-compliance • Change past state representation – Deduplication – Compression