Windows Azure Storage A Highly Available Cloud Storage
Windows Azure Storage – A Highly Available Cloud Storage Service with Strong Consistency Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam Mc. Kelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin Mc. Nett, Sriram Sankaran, Kavitha Manivannan, Leonidas Rigas Microsoft Corporation
Geographically Distributed across 3 Regions Anywhere at Anytime Access to data >200 Petabytes of raw storage by December 2011
• • Blobs Tables Queues Drives
Windows Azure Storage High Level Architecture
Design Goals
Access blob storage via the URL: http: //<account>. blob. core. windows. net/ Storage Location Service Data access LB LB Front-Ends Partition Layer Stream Layer Inter-stamp (Geo) replication Stream Layer Intra-stamp replication Storage Stamp
• • • Append-only distributed file system All data from the Partition Layer is stored into files (extents) in the Stream layer An extent is replicated 3 times across different fault and upgrade domains • • Checksum all stored data • • • With random selection for where to place replicas for fast MTTR Verified on every client read Scrubbed every few days Re-replicate on disk/node/rack failure or checksum mismatch Stream Layer (Distributed File System) M M Paxos M Extent Nodes (EN)
• • Provide transaction semantics and strong consistency for Blobs, Tables and Queues Stores and reads the objects to/from extents in the Stream layer Provides inter-stamp (geo) replication by shipping logs to other stamps Scalable object index via partitioning Partition Master Lock Service Partition Layer Partition Server M Stream Layer M Paxos M Extent Nodes (EN)
• • • Front End Layer FE FE Stateless Servers Authentication + authorization Request routing FE FE FE Partition Master Lock Service Partition Layer Partition Server M Stream Layer M Paxos M Extent Nodes (EN)
Incoming Write Request Ack Front End Layer FE FE FE Partition Master Lock Service Partition Layer Partition Server M Stream Layer M Paxos M Extent Nodes (EN)
Partition Layer
• Need a scalable index for the objects that can • Spread the index across 100 s of servers • Dynamically load balance • Dynamically change what servers are serving each part of the index based on load
Blob Index Account Name Container Name Blob Name aaaa aaaaa ……. . ……… ……. . Account Container harry pictures Name ……. . Front-End harry pictures ……. . Server ……. . ……… ……. . A-H: PS 1 ……… ……. . PS 2 Account H’-R: Container richard videos Name R’-Z: Name PS 3 ……. . richard videos ……. . Partition ……… ……. . Map……. . Blob sunrise Name ……. . sunset ……. . ……… ……. . Blob soccer Name ……. . tennis ……. . ……… ……. . zzzz zzzzz Storage Stamp PS 1 PS 2 A-H: PS 1 Partition H’-R: PS 2 Master R’-Z: PS 3 Partition Server A-H Partition Server H’-R Partition Map Partition Server R’-Z PS 3
Writes Commit Log Stream Metadata log Stream Read/Query Checkpoint File Table Blob Data
Stream Layer
Stream //foo/myfile. data Extent E 1 Extent E 2 Extent E 3 Block Ptr E 4 Block Block Ptr E 3 Block Ptr E 2 Block Ptr E 1 Extent E 4
Paxos Partition Layer Create Stream/Extent EN 1 Primary EN 2, EN 3 Secondary SM Stream SM Master Allocate Extent replica set EN 1 Primary EN 2 Secondary A EN 3 Secondary B EN
Paxos Partition Layer Ack EN 1 Primary EN 2, EN 3 Secondary SM SM SM Append EN 1 Primary EN 2 Secondary A EN 3 Secondary B EN
Stream //foo/myfile. dat Ptr E 1 Ptr E 2 Ptr E 3 Ptr E 4 Ptr E 5 ? Extent E 1 Extent E 2 Extent E 3 Extent E 4 Extent E 5
Paxos Partition Layer Append SM Stream SM Master Seal Extent 120 Seal Extent Sealed at 120 Ask for current length EN 1 EN 2 EN 3 Primary Secondary A Secondary B EN 4
Paxos SM Stream SM Master Partition Layer 120 Seal Extent Sealed at 120 Sync with SM EN 1 EN 2 EN 3 Primary Secondary A Secondary B EN 4
Paxos Partition Layer Append Seal Extent 120 SM SM SM Seal Extent Sealed at 100 Ask for current length 100 EN 1 EN 2 EN 3 Primary Secondary A Secondary B EN 4
Paxos SM SM SM Partition Layer 100 Seal Extent Sealed at 100 Sync with SM EN 1 EN 2 EN 3 Primary Secondary A Secondary B EN 4
• For Data Streams, Partition Layer only reads from offsets returned from successful appends • • SM SM SM Partition Server Committed on all replicas Row and Blob Data Streams • Offset valid on any replica EN 1 Safe to read from EN 3 EN 2 EN 3 Network partition • PS can talk to EN 3 • SM cannot talk to EN 3 Primary Secondary A Secondary B
• Logs are used on partition load • Commit and Metadata log streams SM SM SM • Check commit length first • Only read from • • Check commit length Use EN 1, EN 2 for loading Partition Server Unsealed replica if all replicas have the same commit length A sealed replica Seal Extent Check commit length EN 1 EN 2 EN 3 Network partition • PS can talk to EN 3 • SM cannot talk to EN 3 Primary Secondary A Secondary B
Design Choices and Lessons Learned
• Multi-Data Architecture • Use extra resources to serve mixed • Greatly simplifies replication protocol and failure handling workload for incremental costs • • Blob -> storage capacity Table -> IOps Queue -> memory Drives -> storage capacity and IOps • Multiple data abstractions from a single stack • Consistent and identical replicas up to the extent’s commit length • • Keep snapshots at no extra cost Benefit for diagnosis and repair Erasure Coding Tradeoff: GC overhead • Improvements at lower layers help all data abstractions • Simplifies hardware management • Tradeoff: single stack is not optimized for specific workload pattern • Allows each to be scaled separately • Important for multitenant environment • Moving toward full bisection bandwidth between compute and storage • Tradeoff: Latency/BW to/from storage
Windows Azure Storage Summary http: //blogs. msdn. com/windowsazurestorage/
- Slides: 34