AZR 302 Windows Azure Internals Mark Russinovich Technical
- Slides: 62
AZR 302 Windows Azure Internals Mark Russinovich Technical Fellow Windows Azure
Windows Azure Datacenter Architecture
Word SQL Server Exchange Online SQL Azure Datacenter
Spine TOR TOR TOR
DLA Architecture (Old) Quantum 10 Architecture (New) DC Router DCR Access Routers BL BL Aggregation + LB Spine AGG L B L B 20 Racks AGG L B L B AGG L B 20 Racks L B TOR TOR TOR TOR Digi Digi Digi Digi 40 Nodes 40 Nodes 40 Nodes 40 Nodes APC APC APC APC … … … BL Spine L B 20 Racks … BL AGG L B TOR … Spine DC Routers DCR … … 40 Nodes APC TOR … TOR
i 10. 2. 3. 4 65. 123. 44. 22 Instance 0 Load Balancer 0. 5 ms 10. 2. 3. 5 Instance 1
Deploying Services
Image Repository Maintenance OS FC Host Agent Fabric Controller Windows Azure Parent OS Role Images Windows Azure Node OS Windows Azure Hypervisor Windows Deployment Server PXE Server
REST APIs Fabric Controller US-North Central Datacenter
Role B Worker Role www. mycloudapp. net Count: 2 Update Domains: 2 Size: Medium www. mycloudapp. net Load Balancer
Physical Node Guest Partition Role Instance Guest Agent Trust boundary Host Partition FC Host Agent Fabric Controller (Primary) Image Repository (OS VHDs, role ZIP files) Fabric Controller (Replica) … Fabric Controller (Replica)
Role Virtual Machine C: Resource Disk Dynamic VHD Windows VHD Role VHD
OS Volume Resource Volume Role Volume Guest Agent Role Host Role Entry Point
i Core Package 1 Portal 2 RDFE 3 FC Auxiliary Files 1 Data 2 4 Server
Inside Iaa. S VMs
Virtual Disk Driver Local RAM Cache Local On-Disk Cache Disk Blob
Role Virtual Machine C: OS Disk D: Resource Disk Dynamic VHD E: , F: , etc. Data Disks RAM Cache Local Disk Cache Blobs
i
Updating Services and the Host OS
Front. End-1 End-2 max is 20 Middle Tier-3 Tier-1 Tier-2 Front. End-1 Front. End-2 Middle Tier-1 Middle Tier-2 Middle Tier-3 Update Domain 2 Update Domain 3 Update Domain 1
i
Maintaining Service Health
Problem Fabric Detection Fabric Response Role instance crashes FC guest agent monitors role termination FC host agent notices missing guest agent heartbeats FC notices missing host agent heartbeat FC restarts role Guest VM or agent crashes Host OS or agent crashes Detected node hardware issue Host agent informs FC FC restarts VM and hosted role Tries to recover node FC reallocates roles to other nodes FC migrates roles to other nodes Marks node “out for repair”
Guest Agent Heartbeat 5 s 25 min Guest Agent Heartbeat Timeout 10 min Guest Agent Connect Timeout Role Instance Launch Indefinite Role Instance Load Balancer Role Instance Heartbeat “Unresponsive” Timeout. Heartbeat Timeout 15 s 15 min 30 s Role Instance Ready Start (for updates only)
Front. End-1 Front. End-2 Middle Tier-1 Middle Tier-2 Middle Tier-3
Role B Worker Role www. mycloudapp. net Count: 2 Update Domains: 2 Size: Medium www. mycloudapp. net Load Balancer
Service B Role A-1 UD 2 Service B Role B-2 UD 2 Allocation 1 Allocation algorithm: Prefer nodes hosting same UD as role instance’s UD Service B Role A-1 UD 2 Service B Role B-2 UD 2 Allocation 2
i Front. End-1 Front. End-2 Middle Tier-1 Middle Tier-2 Middle Tier-3
The Leap Day Outage: Cause and Lessons Learned
expiredate. year = currentdate. year + 1;
Start End Date Primary Secondary Backup 1 Backup 2 Friday, January 13 2012 11: 00 AM 10: 59 AM densamo gagupta padou anue Saturday, January 14 2012 11: 00 AM 10: 59 AM jimjohn mkeating chuckl padou Sunday, January 15 2012 11: 00 AM 10: 59 AM anilingl absingh chuckl padou Monday, January 16 2012 11: 00 AM 10: 59 AM sushantr lisd saadsyed sushantr Tuesday, January 17 2012 11: 00 AM 10: 59 AM coreysa ppatwa ksingh ritwikt Wednesday, January 18 2012 11: 00 AM 10: 59 AM wakkasr soupal ritwikt padou 11: 00 AM 10: 59 AM roylin mkeating anue padou Thursday, January 192012
Event Date (PST) Response and Recovery Timeline Initiating Event 2/28/2012 16: 00 Leap year bug begin Detection 2/28 17: 15 3 x 25 min retry for first batch hit, nodes start going to HI (cascading failure) Phase 1 2/28 16: 00 – 2/29 05: 23 New deployments fail initially and then marked offline globally to protect clusters Phase 2 2/29 02: 57 – 2/29 23: 00 Service management offline for 7 clusters (staggered recovery)
Host OS Host Agent Application VM Guest Agent Public Key Private Key Hypervisor Create a “transport cert”
Host OS Host Agent Application VM Guest Agent App VM Guest Agent Hypervisor After 3 fail attempts… healthy to run All Existing new After VMs 25 minutes… to VMs start continue (Service Management) (until migrated)
Cascade protection threshold hit (60 VMs nodes) Deploying an infra update or customer VM Normal “Service healing” migrates Leap VMs day cause starts… nodes to fail The cascade is viral… “normal” failure stop! Allor healing andhardware infra deployment 44
Ø Customer 1: Complete Availability Loss Ø Customer 2: Partial Capacity Loss Ø Customer 3: No Availability Loss 45
119 HA v 2 Fixed 119 Package 119 Networking Plugin 119 GA v 2 119 OS VM 119 GA v 1 VM VM 119 GA v 1 119 HA v 1 119 Networking Plugin 119 OS Network
118 HA HA v 2 118 v 2 Fixed 118 Package 119 Networking Plugin 118 OS VM VM VM 118 GA v 1 118 HA v 1 119 HA v 1 118 Networking Plugin 119 Networking Plugin 118 OS Network 119 OS Network
119 HA HA v 2 119 v 2 Fixed 119 Package on 118 Cluster 119 Networking Plugin 119 OS VM – 1 VM – 2 VM – 3 VM – 1 VM – 3 VM – 4 VM – 2 VM – 4 VM – 5 118 GA v 1 118 GA v 1 118 GA v 1 118 HA v 2 119 Networking Plugin 118 OS Network
Fixed 119 GA GA v 2 Automatic update cannot proceed Not enough healthy instances… VM – 1 VM – 2 VM – 3 118 GA v 1 VM – 3 VM – 4 VM – 2 VM – 4 VM – 5 118 GA v 1 118 GA v 1 119 HA v 2 119 Networking Plugin 119 OS Network
Fixed 119 GA 119 119 GA v 2 GA GA v 2 Manual Update Succeeds But takes a long time… VM – 1 VM – 2 VM – 3 VM – 1 VM – 3 VM – 4 VM – 2 VM – 4 VM – 5 118 GA v 1 118 GA v 1 118 GA v 1 119 HA v 2 119 Networking Plugin 119 OS Network
@Windows. Azure @teched_europe Hands-On Labs DOWNLOAD Windows Azure Meetwindowsazure. com Windowsazure. com/ teched
Learning Connect. Share. Discuss. Microsoft Certification & Training Resources http: //europe. msteched. com www. microsoft. com/learning Tech. Net Resources for IT Professionals Resources for Developers http: //microsoft. com/technet http: //microsoft. com/msdn
Evaluations Submit your evals online http: //europe. msteched. com/sessions
- Local session manager
- Liliana russinovich
- Azure internals
- Azure internals
- Windows kernel internals
- Windows internals
- Advapi logon process
- Windows kernel internals
- Windows kernel internals
- Windows kernel internals
- Cdrom.sys
- Azure plan vs azure global csp
- Mehdi sebbane
- Windows azure hyper v recovery manager
- Kentico windows azure development
- Windows azure private cloud
- What is windows azure platform
- What is windows azure virtual machine
- Zfs internals
- Unix internals
- Operating system internals and design principles
- Linux kernel internals
- Jvm internals
- Spark internals
- Mfc internals
- Operating system internals and design principles
- Mfc
- Mfc hierarchy chart
- Unix internals
- Sql server internals and architecture
- Slidetodoc.com
- Operating systems: internals and design principles
- Operating systems: internals and design principles
- Android internals power user's view
- Operating systems: internals and design principles
- Ntfs internals
- Unix internals: the new frontiers
- Operating systems internals and design principles
- Operating system internals and design principles
- Windows movie maker
- Windows media player 9 windows 2000
- Alternatief windows live mail
- Windows driver kit windows 7
- Windows movie maker download microsoft
- Skin windows media player
- Windows identity foundation windows 10
- Upgrade windows 7 to windows 10
- How to setup a windows xp virtual machine
- Nokia lumia 920 windows 10
- Movie maker 2012
- Ipseq
- Windows xp virtualization
- 7285 x 302
- 302 criteria
- Mat 302 bmcc
- Edu 302
- Eee 302
- Eee 302
- Ce 302
- Se um hino cantar tu puderes
- A basketball star exerts a force of 3225
- Lesson 302
- Bus 302 csun materials