Implementing ASM Without HW RAID A Users Experience

Outlook • Introduction to ASM – Disk groups, fail groups, normal redundancy • Scalability

Architecture and main concepts • Why ASM ? – Provides functionality of volume manager

ASM and cluster DB architecture • Oracle architecture of redundant low-cost components Servers SAN

Files, extents, and failure groups Files and extent pointers Failgroups and ASM mirroring CERN

ASM disk groups • Example: HW = 4 disk arrays with 8 disks each

Performance and scalability • ASM with normal redundancy – Stress tested for CERN’s use

Case Study: the largest cluster I have ever installed, RAC 5 • The test

Multipathed fiber channel • 8 FC switches: 4 Gbps (10 Gbps uplink) CERN IT

Many spindles • 26 storage arrays (16 SATA disks each) CERN IT Department CH-1211

Case Study: I/O metrics for the RAC 5 cluster • Measured, sequential I/O –

How the test was run • A custom SQL-based DB workload: – IOPS: Probe

Possible pitfalls • Production Stories – Sharing experiences – 3 years in production, 550

Rebalancing speed • Rebalancing is performed (and mandatory) after space management operations – –

Rebalancing, an example CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

VLDB and rebalancing • Rebalancing operations can move more data than expected • Example:

Rebalancing issues wrap-up • Rebalancing can be slow – Many hours for very large

Fast Mirror Resync • ASM 10 g with normal redundancy does not allow to

ASM and filesystem utilities • Only a few tools can access ASM – –

ASM and corruption • ASM metadata corruption – Can be caused by ‘bugs’ –

Disaster recovery • Corruption issues were fixed using physical standby to move to ‘fresh’

Implementation details CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Storage deployment • Current storage deployment for Physics Databases at CERN – SAN, FC

Storage deployment • ASM implementation details – Storage in JBOD configuration (1 disk ->

Storage deployment • Two diskgroups created for each cluster – DATA – data files

Storage management • SAN configuration in JBOD configuration – many steps, can be time

Storage management • Storage manageability – DBAs set-up initial configuration – ASM – extra

Storage management • Solution – Configuration DB - repository of FC switches, port allocations

Storage management Custom made script [ ~]$ lssdisks. py The following storages are connected:

Storage management Custom made script [ ~]$ listdisks. py DISK NAME GROUP_NAME FG H_STATUS

Storage management [ ~]$ gen_multipath. py # multipath default configuration for PDB Custom made

Storage monitoring • ASM-based mirroring means – Oracle DBAs need to be alerted of

Storage monitoring • ASM instance level monitoring new failing disk on RSTOR 614 •

Conclusions • Oracle ASM diskgroups with normal redundancy – – Used at CERN instead

Q&A Thank you • Links: – http: //cern. ch/phydb – http: //www. cern. ch/canali

Slides: 35

Download presentation

Implementing ASM Without HW RAID, A User’s Experience Luca Canali, CERN Dawid Wojcik, CERN UKOUG, Birmingham, December 2008 CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Outlook • Introduction to ASM – Disk groups, fail groups, normal redundancy • Scalability and Performance of the solution • Possible pitfalls, sharing experiences • Implementation details, monitoring, and tools to ease ASM deployment CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Architecture and main concepts • Why ASM ? – Provides functionality of volume manager and a cluster file system – Raw access to storage for performance • Why ASM-provided mirroring? – Allows to use lower-cost storage arrays – Allows to mirror across storage arrays • arrays are not single points of failure • Array (HW) maintenances can be done in a rolling way – Stretch clusters CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

ASM and cluster DB architecture • Oracle architecture of redundant low-cost components Servers SAN Storage CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Files, extents, and failure groups Files and extent pointers Failgroups and ASM mirroring CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

ASM disk groups • Example: HW = 4 disk arrays with 8 disks each • An ASM diskgroup is created using all available disks – – The end result is similar to a file system on RAID 1+0 ASM allows to mirror across storage arrays Oracle RDBMS processes directly access the storage RAW disk access ASM Diskgroup Mirroring Striping Failgroup 1 CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Striping Failgroup 2

Performance and scalability • ASM with normal redundancy – Stress tested for CERN’s use – Scales and performs CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Case Study: the largest cluster I have ever installed, RAC 5 • The test used: 14 servers CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Multipathed fiber channel • 8 FC switches: 4 Gbps (10 Gbps uplink) CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Many spindles • 26 storage arrays (16 SATA disks each) CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Case Study: I/O metrics for the RAC 5 cluster • Measured, sequential I/O – Read: 6 GB/sec – Read-Write: 3+3 GB/sec • Measured, small random IO – Read: 40 K IOPS (8 KB read ops) • Note: – 410 SATA disks, 26 HBAS on the storage arrays – Servers: 14 x 4+4 Gbps HBAs, 112 cores, 224 GB of RAM CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

How the test was run • A custom SQL-based DB workload: – IOPS: Probe randomly a large table (several TBs) via several parallel queries slaves (each reads a single block at a time) – MBPS: Read a large (several TBs) table with parallel query • The test table used for the RAC 5 cluster was 5 TB in size – created inside a disk group of 70 TB CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Possible pitfalls • Production Stories – Sharing experiences – 3 years in production, 550 TB of raw capacity CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Rebalancing speed • Rebalancing is performed (and mandatory) after space management operations – – Typically after HW failures (restore mirror) Goal: balanced space allocation across disks Not based on performance or utilization ASM instances are in charge of rebalancing • Scalability of rebalancing operations? – In 10 g serialization wait events can limit scalability – Even at maximum speed rebalancing is not always I/O bound CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Rebalancing, an example CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

VLDB and rebalancing • Rebalancing operations can move more data than expected • Example: – 5 TB (allocated): ~100 disks, 200 GB each – A disk is replaced (diskgroup rebalance) • The total IO workload is 1. 6 TB (8 x the disk size!) • How to see this: query v$asm_operation, the column EST_WORK keeps growing during rebalance • The issue: excessive repartnering CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Rebalancing issues wrap-up • Rebalancing can be slow – Many hours for very large disk groups • Risk associated – 2 nd disk failure while rebalancing – Worst case - loss of the diskgroup because partner disks fail CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Fast Mirror Resync • ASM 10 g with normal redundancy does not allow to offline part of the storage – A transient error in a storage array can cause several hours of rebalancing to drop and add disks – It is a limiting factor for scheduled maintenances • 11 g has new feature ‘fast mirror resync’ – Great feature for rolling intervention on HW CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

ASM and filesystem utilities • Only a few tools can access ASM – – Asmcmd, dbms_file_transfer, xdb, ftp Limited operations (no copy, rename, etc) Require open DB instances file operations difficult in 10 g • 11 g asmcmd has the copy command CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

ASM and corruption • ASM metadata corruption – Can be caused by ‘bugs’ – One case in prod after disk eviction • Physical data corruption – ASM will fix automatically most corruption on primary extent – Typically when doing a full backup – Secondary extent corruption goes undetected untill disk failure/rebalance can expose it CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Disaster recovery • Corruption issues were fixed using physical standby to move to ‘fresh’ storage • For HA our experience is that disaster recovery is needed – Standby DB – On-disk (flash) copy of DB CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Implementation details CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Storage deployment • Current storage deployment for Physics Databases at CERN – SAN, FC (4 Gb/s) storage enclosures with SATA disks (8 or 16) – Linux x 86_64, no ASM lib, device mapper instead (naming persistence + HA) – Over 150 FC storage arrays (production, integration and test) and ~ 2000 LUNs exposed – Biggest DB over 7 TB (more to come when LHC starts – estimated growth up to 11 TB/year) CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Storage deployment • ASM implementation details – Storage in JBOD configuration (1 disk -> 1 LUN) – Each disk partitioned on OS level • 1 st partition – 45% of disk size – faster part of disk – short stroke • 2 nd partition – rest – slower part – full stroke inner sectors – full stroke outer sectors – short stroke CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Storage deployment • Two diskgroups created for each cluster – DATA – data files and online redo logs – outer part of the disks – RECO – flash recovery area destination – archived redo logs and on disk backups – inner part of the disks • One failgroup per storage array DATA_DG 1 RECO_DG 1 Failgroup 1 CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it Failgroup 2 Failgroup 3 Failgroup 4

Storage management • SAN configuration in JBOD configuration – many steps, can be time consuming – Storage level • logical disks • LUNs • mappings – FC infrastructure – zoning – OS – creating device mapper configuration • multipath. conf – name persistency + HA CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Storage management • Storage manageability – DBAs set-up initial configuration – ASM – extra maintenance in case of storage maintenance (disk failure) – Problems • How to quickly set-up SAN configuration • How to manage disks and keep track of the mappings: physical disk -> LUN -> Linux disk -> ASM Disk CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it SCSI [1: 0: 1: 3] & [2: 0: 1: 3] -> /dev/sdn & /dev/sdax -> /dev/mpath/rstor 901_3 -> ASM – TEST 1_DATADG 1_0016

Storage management • Solution – Configuration DB - repository of FC switches, port allocations and of all SCSI identifiers for all nodes and storages • Big initial effort • Easy to maintain • High ROI – Custom tools • Tools to identify – SCSI (block) devices <-> device mapper device <-> physical storage and FC port – Device mapper device <-> ASM disk • Automatic generation of device mapper configuration CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Storage management Custom made script [ ~]$ lssdisks. py The following storages are connected: * Host interface 1: Target ID 1: 0: 0: - WWPN: 210000 D 0230 BE 0 B 5 - Storage: rstor 316, Port: 0 Target ID 1: 0: 1: - WWPN: 210000 D 0231 C 3 F 8 D - Storage: rstor 317, Port: 0 Target ID 1: 0: 2: - WWPN: 210000 D 0232 BE 081 - Storage: rstor 318, Port: 0 Target ID 1: 0: 3: - WWPN: 210000 D 0233 C 4000 - Storage: rstor 319, Port: 0 Target ID 1: 0: 4: - WWPN: 210000 D 0234 C 3 F 68 - Storage: rstor 320, Port: 0 SCSI id (host, channel, id) -> storage name and FC port * Host interface 2: Target ID 2: 0: 0: - WWPN: 220000 D 0230 BE 0 B 5 - Storage: rstor 316, Port: 1 Target ID 2: 0: 4: - WWPN: 220000 D 0234 C 3 F 68 - Storage: rstor 320, SCSI ID -> block device Port: 1 -> device mapper name and Port: status 1 -> storage name and FC Port: 1 port SCSI Id Storage Target ID 2: 0: 1: - WWPN: 220000 D 0231 C 3 F 8 D - Storage: rstor 317, Port: 1 Target ID 2: 0: 2: - WWPN: 220000 D 0232 BE 081 - Storage: rstor 318, Target ID 2: 0: 3: - WWPN: 220000 D 0233 C 4000 - Storage: rstor 319, Block DEV MPath name MP status Port ---------------- -------------- CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it [0: 0: 0: 0] /dev/sda - - [1: 0: 0: 0] /dev/sdb rstor 316_CRS OK rstor 316 0 [1: 0: 0: 1] /dev/sdc rstor 316_1 OK rstor 316 0 [1: 0: 0: 2] /dev/sdd rstor 316_2 FAILED rstor 316 0 [1: 0: 0: 3] /dev/sde rstor 316_3 OK rstor 316 0 [1: 0: 0: 4] /dev/sdf rstor 316_4 OK rstor 316 0 [1: 0: 0: 5] /dev/sdg rstor 316_5 OK rstor 316 0 [1: 0: 0: 6] /dev/sdh rstor 316_6 OK rstor 316 0 . . .

Storage management Custom made script [ ~]$ listdisks. py DISK NAME GROUP_NAME FG H_STATUS MODE MOUNT_S STATE TOTAL_GB USED_GB ------------------ -------- ------rstor 401_1 p 1 RAC 9_DATADG 1_0006 RAC 9_DATADG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 111. 8 68. 5 rstor 401_1 p 2 RAC 9_RECODG 1_0000 RAC 9_RECODG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 119. 9 1. 7 rstor 401_2 p 1 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 111. 8 rstor 401_2 p 2 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 120. 9 rstor 401_3 p 1 RAC 9_DATADG 1_0007 RAC 9_DATADG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 111. 8 68. 6 rstor 401_3 p 2 RAC 9_RECODG 1_0005 RAC 9_RECODG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 120. 9 1. 8 rstor 401_4 p 1 RAC 9_DATADG 1_0002 RAC 9_DATADG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 111. 8 68. 5 rstor 401_4 p 2 RAC 9_RECODG 1_0002 RAC 9_RECODG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 120. 9 1. 8 rstor 401_5 p 1 RAC 9_DATADG 1_0001 RAC 9_DATADG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 111. 8 68. 5 rstor 401_5 p 2 RAC 9_RECODG 1_0006 RAC 9_RECODG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 120. 9 1. 8 rstor 401_6 p 1 RAC 9_DATADG 1_0005 RAC 9_DATADG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 111. 8 68. 5 rstor 401_6 p 2 RAC 9_RECODG 1_0007 RAC 9_RECODG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 120. 9 1. 8 rstor 401_7 p 1 RAC 9_DATADG 1_0000 RAC 9_DATADG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 111. 8 68. 6 rstor 401_7 p 2 RAC 9_RECODG 1_0001 RAC 9_RECODG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 120. 9 1. 8 rstor 401_8 p 1 RAC 9_DATADG 1_0004 RAC 9_DATADG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 111. 8 68. 6 rstor 401_8 p 2 RAC 9_RECODG 1_0004 RAC 9_RECODG 1 RSTOR 401 MEMBER ONLINE CACHED NORMAL 120. 9 1. 8 RAC 9_DATADG 1_0015 RAC 9_DATADG 1 RSTOR 402 MEMBER ONLINE CACHED NORMAL 111. 8 59. 9 rstor 401_CRS 1 rstor 401_CRS 2 rstor 401_CRS 3 rstor 402_1 p 1. . . CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it device mapper name -> ASM disk and status

Storage management [ ~]$ gen_multipath. py # multipath default configuration for PDB Custom made script defaults { udev_dir /dev polling_interval 10 selector "round-robin 0" . . . }. . . device mapper alias – naming persistency and multipathing (HA) multipaths { multipath { wwid 3600 d 0230006 c 26660 be 0 b 5080 a 407 e 00 alias rstor 916_CRS } multipath { wwid 3600 d 0230006 c 26660 be 0 b 5080 a 407 e 01 alias rstor 916_1 }. . . } CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it SCSI [1: 0: 1: 3] & [2: 0: 1: 3] -> /dev/sdn & /dev/sdax -> /dev/mpath/rstor 916_1

Storage monitoring • ASM-based mirroring means – Oracle DBAs need to be alerted of disk failures and evictions – Dashboard – global overview – custom solution – RACMon • ASM level monitoring – Oracle Enterprise Manager Grid Control – RACMon – alerts on missing disks and failgroups plus dashboard • Storage level monitoring – RACMon – LUNs’ health and storage configuration details – dashboard CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Storage monitoring • ASM instance level monitoring new failing disk on RSTOR 614 • Storage level monitoring new disk installed on RSTOR 903 slot 2 CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Conclusions • Oracle ASM diskgroups with normal redundancy – – Used at CERN instead of HW RAID Performance and scalability are very good Allows to use low-cost HW Requires more admin effort from the DBAs than high end storage – 11 g has important improvements • Custom tools to ease administration CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it

Q&A Thank you • Links: – http: //cern. ch/phydb – http: //www. cern. ch/canali CERN IT Department CH-1211 Genève 23 Switzerland www. cern. ch/it