Oracle RAC cluster application scalability Eric Grancher and
Oracle RAC (cluster) application scalability Eric Grancher and Anton Topurov CERN IT department / openlab Introduction – Aim Results Oracle Real Application Cluster (RAC) is the Oracle cluster technology. It has been used at CERN since 1996 (it was called Oracle Parallel Server up to Oracle version 8. 1). RAC is increasingly used in High Energy Physics. It is used at the LCG Tier 1 sites and at CERN. It is a “shared-everything” clustered system where each of the clustered nodes directly accesses the storage via a Storage Area Network (typically Fibre Channel) or via Network Attached Storage (typically NFS over Ethernet). RAC provides a scalable solution in terms of high-availability and performance. The mechanism for the coordination of the data changes from the different cluster nodes is called “cache fusion”. Applications which are not “read-mostly” may not easily scale with the number of nodes in a RAC cluster. Clients (interactive, batch) Disks 1. request Figure 1. RAC logical view (shared disk, servers, clients). Database servers Siemens/ETM’s PVSS software is a key component of the supervisory layer of the control systems for the LHC accelerator and experiments. It is a SCADA system. One of its components is the module that archives the monitoring data events in a database. Current requirements are as high as 150’ 000 events per second. Global Cache Service 100 events/s Initial tests have shown that “out of the box” performance was about 100 events per second archiving. The database was waiting on the clients “SQL*Net message from client”. Investigations have shown that: - a generic C++ database API was used (no bulk loading) - a trigger was launched after each event upload. 1 API change 2’ 000 events/s enq: TX - contention Time (s) Waits The methodology used for performance tuning is based on the Oracle wait interface. Before version 7, developers and database administrators have been working with ratios that provide a statistical view over some database key indicators (in memory / disk sorts, full table scans / indexed table access…). As of Oracle version 7, the database presents the instrumentation results (event 10046) through the wait interface. It has completely changed the methodology for performance tuning. In version 10, Oracle has introduced Active Session History (ASH) and Automatic Workload Repository (AWR). They provide automatic gatherings of session and instancewide wait events aggregation. The methodology used for this project is based on an iterative AWR/ASH based tuning exercise, until the required performance was achieved. 1. run the workload, gather ASH/AWR information 4. modify client code database schema database code hardware configuration 3. understand why time is spent on this event CERN - IT Department CH-1211 Genève 23 Switzerland 2. find the top event that slows down the processing Figure 2. iterative methodology for performance tuning. instance 1 % Total Call Time 3 RAC aware change 150’ 000 events/s In order to reduce the cluster contention, a new scheme for the insertion has been developed. In the new scheme: - each “client” receives a unique number - each of the tables is partitioned - each of the “clients” inserts into a separate partition. Wait Class row cache lock CPU time 155 6. 4 Cluster 123 5. 1 log file parallel write 1, 542 109 4. 5 System I/O 37. 22 Other CPU time 61 Waits Time (s) % Total Call Time 31. 6 Cluster 369 16. 0 gc current block busy 6, 818 255 11. 1 Cluster gc current grant busy 24, 370 228 9. 9 Cluster 198 8. 6 Cluster The iterative methodology for RAC scalability using the wait interface features provided in Oracle 10 g has been key for the success of this work. 88 3. 6 Configuration iteration# 0 1 Thanks to these changes, the system achieved 150’ 000 events per second at peak. It has been observed that under heavy load the event loading can sometimes freeze. 4 Space allocation + stable rate Thanks to these changes, the system now achieves a consistent insertion rate of 150’ 000 events per second. 1000 Hz 1 update eventlastval set … 2 eventlastval 3 x 150 events_ history 2 3 4 Graph 1. Events per second loaded in the database. A number of modifications have been made to pre-allocate the space (tablespace re-utilization as a background task). Wait Class 728 785, 4 39 It requires an adequate data model and a database client interface capable of supporting bulk operations. Instrumentation in the database and in the application is necessary for achieving fine tuning results. We have used extensively the table partitioning feature to achieve such transaction rates. 27. 6 Concurrency Table 3. AWR report during a test after iteration 3. 16’ 000 events/s 27, 883 gc current block 2 - 118, 45 way 4 undo segment extension We have shown that nearly linear scalability can be obtained with RAC for some typical write-mostly workload. The tests have been done on a 6 node RAC cluster utilizing 32 SATA disks connected via Fibre Channel to the hosts. Wait Class 7, 218 120 Table 1. AWR report during a test after iteration 1. Time % Total (s) Call Time 665 41 18. 88 Waits 813 gc current multi block request In order to reduce the I/O needs (“db file sequential read” is a wait event that is linked with I/O request), several changes have been tested and implemented: - index usage analysis and reduction - table structure change (move to Index Organised Table) - replacement of the merge by an insert - use of direct path insert (bypassing the buffer cache). gc buffer busy Event 42. 56 User I/O Thanks to these changes, the system achieved 2’ 000 events per second. instance 6 Figure 3. data block has been changed on instance 5. This block is held in exclusive mode by Instance 5. Instance 1 attempts to modify the block. The data block is not written to disk, all communication is done over the cluster interconnect. 137 CPU time Event instance 5 29, 242 2 Schema change Methodology 3. copy the block The company providing the software made a set of radical changes following the initial tests and recommendations. They changed the API to use OCCI (Oracle C++ Call Interface) and after bulk loading a set of events into a temporary table, the software now makes a call to a PL/SQL package. db file sequential read Conclusions 4. inform GCS 0 Initial Event 2. forward trigger: update eventlast … insert into events_history (…) values (…) References Shiers J 2006 The Evolution of Databases in HEP Computing in High Energy and Nuclear Physics (13 -17 February, 2006) Dyke J and Shaw S 2006 Pro Oracle Database 10 g RAC on Linux (Apress) Morle J 2006 Connection management in an Oracle RAC Configuration <http: //www. scaleabilities. co. uk/component/option, com_re mository/Itemid, 53/func, fileinfo/id, 12/> Lahiri T, Srihari V, Chan W, Macnaughton N and Chandrasekaran S 2001 Cache Fusion: Extending Shared -Disk Clusters with Shared Caches Proceedings of the 27 th international Conference on Very Large Data Bases (September 11 - 14, 2001) Cache Fusion and the Global Cache Service Oracle 9 i Real Application Clusters Concepts Release 2 (9. 2) Part Number A 96597 -01 x 6 Table 2. AWR Thanks to these changes, the system report during a test achieved 16’ 000 events per second. The top after iteration 2. wait events indicate clearly that the system is slowed down because of cluster related wait events. 1 bulk insert into temp table 2 3 x 150 temp table PL/SQL: insert /*+ APPEND */ into events_history (…) history partition PARTITION (1) select … from temp Figure 4. description of the architecture changes. Test setup with a 6 nodes RAC and 150 clients (lxplus) each generating 1’ 000 changes/s. http: //www. cern. ch/it/ x 6 Acknowledgments We thank Manuel Gonzalez Berges, Piotr Golonka, Wayne Salter (CERN IT-CO), Chris Lambert, Nilo Segura Chinchilla (CERN IT-DES), Luca Canali (CERN IT-PSS), Svetozár Kapusta (ALICE experiment), Bjørn Engsig and Lothar Flatz (Oracle).
- Slides: 1