HADR troubleshooting and best practices a Pepsi Co
HADR troubleshooting and best practices a Pepsi. Co guide Speaker Kimberly Floss, Pepsi. Co Inc. Chris Godfrey, IBM Inc. Session Code: C 11 Wed, May 06, 2015 (10: 30 AM - 11: 30 AM) : Jefferson/Ross | Platform: DB 2 9. 7, 10. 1, 10. 5
Introduction 1. Slowdowns § Primary is slow § Stand by is slow § Network 2. Hangs (investigate as a normal hang scenario) § Primary is hung § Is it really hung or just very vary slow? ? § Standby is hung 2
Recap - HADR EDU’s • EDU’s of interest • On both primary and standby • db 2 lfr • db 2 loggw Log file reader. Logger. Log writer. Despite the name, on standby, the HADR EDU writes logs to disk, for performance reasons. • On standby only. Same EDU’s used for rollforward • db 2 agnti • db 2 shred • db 2 redom • db 2 redow An independent agent (not associated to any client session) that drives rollforward. Shredder. Parse log pages into log records. Redo master. Master EDU for log replay. Worker EDU for log replay. There may be multiple worker EDU’s to perform parallel replay. 3
HADR Data Flow Standby Primary peer and remote catchup HADR EDU log records SQL peer state log page loggw HADR connection remote catchup HADR EDU peer and remote catchup LFR write log pages data write table space write log pages log shredder log pages local catchup LFR read log pages old log records log pages log old log redo master log records redo workers data write table space 4
Tools • OS • • • vmstat iostat netstat ps tprof • DB 2 • • • db 2 pd snapshots db 2 trc db 2 pdlog simhadr • Custom scripts E. g HADRTrigger. Primary. pl HADRTrigger. Standby. pl collect. Primary. ksh collect. Standby. ksh 5
db 2 set HADR Variables DB 2_HADR_BUF_SIZE Pool of the log data in memory on the standby, by increasing the size of the standby's log receive buffer. The needed size depends on the anticipated log generation rate and the duration of the congestion. The default receive buffer size is two times primary log buffer size DB 2_HADR_SOSNDBUF / DB 2_HADR_SORCVBUF This changes the TCP send buffer for the OS. When send buffer is larger than a flush, the flush can usually be sent in one send() call. The size of the flush can be determined by calculating the number of 4 k pages between SQLPG_PHF_END_OF_FLUSH records in the log files (need to check bit value 0 x 00000010). A link to a perl script db 2 flushsize can be found on the simhadr link, this script will calculate the flushsize without having to format the log files. 6
db 2 set HADR Variables DB 2 BPVARS PREC_NUM_AGENTS Controls number of redow workers, on large systems a large number of redo workers can actually slow down recovery by the db 2 remow latching resources while looking for work. The value should be the power of 2 plus 1(for the db 2 redom), eg, 3, 5, 9, 17, 33 QLB_CFG_PREC_NUM_WOCB (tablespace waiter control block). Whenever there's a tablespace blocking log record, redo master calls sqlpr. Get. Wait. Others() to get a preallocated WOCB control block to sync log record processing. When the control blocks are used up, redo master has to wait until redo worker frees some. Default number of this control block is 128. #define SQLPR_NUM_WAITOTHERS 128 Uint num. WOCB = SQLPR_NUM_WAITOTHERS; QLB_CFG_PREC_NUM_WOCB can be verified in the db 2 diag. log, per. 2012 -09 -10 -23. 51. 11. 011639 -600 I 9636 A 504 LEVEL: Warning PID : 16056364 TID : 6170 PROC : db 2 sysc 0 INSTANCE: hans NODE : 000 DB : TST APPHDL : 0 -8 APPID: *LOCAL. DB 2. 120911095111 EDUID : 6170 EDUNAME: db 2 agent (TST) 0 FUNCTION: DB 2 UDB, recovery manager, sqlprecm, probe: 1181 MESSAGE : SQLB_CFG_PREC_NUM_WOCB DATA #1 : Hexdump, 8 bytes 0 x 070000000 A 3 FD 140 : 0000 0400 Number of queues 0 x 400 = 1024 7
db 2 pd -hadr • The key db 2 pd option for information on the status of HADR is –hadr. It needs to be collected on both the primary and standby with multiple iterations. V 9. 5/7: HADR Information: Role State Standby Peer Sync. Mode Heart. Beats. Missed Nearsync 0 Log. Gap. Run. Avg (bytes) 21072124 Connect. Status Connect. Timeout Connected Sun Dec 12 01: 45: 42 2010 (1292139942) 300 Peer. Window. End Peer. Window Tue Jan 18 17: 00 2011 (1295392020) 300 Local. Host chtaixdb 4 Remote. Host ibmcfdb 4 Local. Service 55555 Remote. Instance vero Primary. File Primary. Pg S 0007601. LOG 13620 Primary. LSN 0 x 000000 EDD 6 EDF 3 EC Stand. By. File Stand. By. Pg S 0007601. LOG 5997 Stand. By. LSN Stand. By. Rcv. Buf. Used 0 x 000000 EDD 513 D 470 80% 8
db 2 pd - hadr • V 10. 1 HADR_ROLE REPLAY_TYPE HADR_SYNCMODE STANDBY_ID LOG_STREAM_ID HADR_STATE PRIMARY_MEMBER_HOST PRIMARY_INSTANCE PRIMARY_MEMBER STANDBY_MEMBER_HOST STANDBY_INSTANCE STANDBY_MEMBER HADR_CONNECT_STATUS_TIME HEARTBEAT_INTERVAL(seconds) HADR_TIMEOUT(seconds) TIME_SINCE_LAST_RECV(seconds) PEER_WAIT_LIMIT(seconds) LOG_HADR_WAIT_CUR(seconds) LOG_HADR_WAIT_RECENT_AVG(seconds) LOG_HADR_WAIT_ACCUMULATED(seconds) LOG_HADR_WAIT_COUNT SOCK_SEND_BUF_REQUESTED, ACTUAL(bytes) SOCK_RECV_BUF_REQUESTED, ACTUAL(bytes) PRIMARY_LOG_FILE, PAGE, POS STANDBY_LOG_FILE, PAGE, POS HADR_LOG_GAP(bytes) STANDBY_REPLAY_LOG_FILE, PAGE, POS STANDBY_RECV_REPLAY_GAP(bytes) PRIMARY_LOG_TIME STANDBY_REPLAY_LOG_TIME STANDBY_RECV_BUF_SIZE(pages) STANDBY_RECV_BUF_PERCENT STANDBY_SPOOL_LIMIT(pages) PEER_WINDOW(seconds) READS_ON_STANDBY_ENABLED = = = = = = = = = = STANDBY PHYSICAL SYNC 0 0 PEER chtwindb 8 asteron 0 ibmcfdb 8 asteron 0 CONNECTED 08/08/2012 10: 53: 13. 745026 (1347155593) 30 120 11 0 0. 000000 0 256000, 256000 S 0000009. LOG, 3695, 994932233 0 09/09/2012 10: 53: 31. 000000 (1347155611) 128000 0 256000 0 N 9
db 2 pd • -appl –tran –lock Collect information on application state, transaction LSN’s and any locking • -bufferpool Bufferpool performance on the Standy can impact on replay performace when pages need to be read into the bufferpool • -latches Are there any latches with multiple waiters? • -edus What is the CPU utilization? • -stacks Stacks on the standby for the db 2 redom and db 2 redow can identify any replay issues 0 x 0900000 EEEF 8 thread_wait + 0 x 98 0 x 0900000007 E 7 DB 50 sqlo. Wait. EDUWait. Post + 0 x 1 C 4 0 x 09000000050 B 99 C 4 sqlpr. Wait. During. PRec__FP 8 sqe. Agent. P 16 SQLO_EDUWAITPOST + 0 x. D 0 0 x 090000000425 B 94 C sqlpr. Get. Wait. Others__FP 8 sqe. Agent. P 10 SQLPR_PRCBPUl + 0 x 1 D 4 0 x 090000000425 ED 3 C sqlpr. Block. Tsp__FP 8 sqe. Agent. P 9 SQLP_DBCBP 20 SQLPR_PR_QUEUE_ENTRYUl. PUl + 0 x 2 F 8 0 x 09000000050 BE 674 sqlp. PRec. Read. Log__FP 8 sqe. Agent. P 8 SQLP_ACBP 9 SQLP_DBCB + 0 x 558 0 x 090000000444 A 810 sqlp. Parallel. Recovery__FP 8 sqe. Agent. P 5 sqlca + 0 x. C 4 0 x 0900000004 ECB 01 C sqle. Sub. Coord. Process. Request__FP 8 sqe. Agent + 0 x 208 0 x 0900000004710 CB 8 Run. EDU__8 sqe. Agent. Fv + 0 x 3 D 4 0 x 0900000007 D 9 A 320 EDUDriver__9 sqz. EDUObj. Fv + 0 x. DC 10
HADR information from other than db 2 pd • Database Snapshot HADR Status Role = Standby State = Peer Synchronization mode = Nearsync Connection status = Connected, 02/07/2012 22: 31: 48. 843517 Peer window end = 02/08/2012 02: 52: 03. 000000 (1328691123) Peer window (seconds) = 300 Heartbeats missed = 0 Local host = pepxd 200157 b Local service = 55555 Remote host = pepxd 200156 b Remote service = 55555 Remote instance = db 2 q 1 h timeout(seconds) = 300 Primary log position(file, page, LSN) = S 0000050. LOG, 5965, 0000039 B 97 D 09677 Standby log position(file, page, LSN) = S 0000050. LOG, 5965, 0000039 B 97 D 09627 Log gap running average(bytes) = 802 • MON_GET_HADR select HADR_ROLE, SUBSTR(HADR_STATE, 1, 6) as HADR_STATE, STANDBY_RECV_BUF_SIZ TANDBY_RECV_BUF_PERCENT from table(mon_get_hadr(null)) HADR_ROLE HADR_STATE STANDBY_RECV_BUF_SIZE STANDBY_RECV_BUF_PERCENT -------------------PRIMARY PEER 128000 +0. 0000000 E+000 1 record(s) selected. 11
db 2 trc • • • 22 23 24 25 When tracing on the standby, as delays may be due to replay of log records the trace should be: db 2 trc on -t -m "*. *. HDR, SQLPR. *. *” db 2 trc on –t –Madd hdr –Madd sqlpr When tracing on the primary it is a suggestion to get both a hdr trace and a separate full trace db 2 trc on -t -m "*. *. HDR. *. *” and db 2 trc on –t. . . db 2 trc on –t –Madd hdr –Madd sqlpr and db 2 trc on –t. . . Example trace of a network congestion 0: 270509016 0: 270509703 0: 270510359 0: 270516488 hdr. Send. Log entry | hdr. Send. Msg entry | | hdr. Send. Buffer exit [rc = 0 x 810 F 001 F = -2129723361 = SQLO_SOCKET_NOT_READY] # rc indicates congestion. Send log hits congestion at time 0: 270516488 # Send request moved to deferred send queue. 26 27 28 29 30 31 32 0: 270518292 0: 270519335 0: 270522788 0: 374878414 0: 374881430 0: 484225528 0: 484228229 | hdr. Send. Msg exit hdr. Send. Log exit hdr. Edu. Accept. Event 33 0: 489034090 | hdr. Send. Msg entry 34 35 36 37 0: 489034796 0: 489190235 0: 489192565 0: 489193954 | | hdr. Send. Buffer entry | | hdr. Send. Buffer exit | hdr. Send. Msg exit hdr. Edu. Accept. Event exit entry # Congestion clears at 0: 489034090 hdr. Send. Msg() is called to resend the deferred request. 12
db 2 pdlog • During a slowdown / hang collecting the log files during this period is important to collect and format these to see what log operations were being replayed on the standby. • To determine the log operations and number per each log: db 2 pdlog –a –s 11051. fmt grep “Function ID =" 11051. fmt | cut -d: -f 2| sort -n | uniq –c |sort –k 1, 1 nr 6622 3652 6095 289 1 MULTI_PAGE_CREATE_DP Function ID = 67 = 103 : CREATEPG_SMS_DP Function ID = 85 = 133 : SPLIT_NODE_BIDI_DP Function ID = 6 B = 107 : GETPAGE_DP Function ID = 73 = 115 : UNLTCH_DP Function ID = 69 = 105 : SQLI_LRT_LATCH Function ID = 02 = 2 : GET_ONE_EXT Function ID = 0 A = 10 : OBJ_ADD_EXT Function ID = 85 = 133 : 13
simhadr • • Info can be found on developer works Used to Estimate HADR performance under different sync modes. Measure disk performance. Measure network performance using HADR specific workload. Experiment with hypothetical disk speed. Experiment with various network tuning options. • DB 2_HADR_SOSNDBUF • DB 2_HADR_SORCVBUF • DB 2_HADR_BUF_SIZE Sample output Sending log flushes. Press Ctrl-C to stop. NEARSYNC: Total 410255360 bytes in 4. 000821 seconds, 102. 542793 MBytes/sec Total 3130 flushes, 0. 001278 sec/flush, 32 pages (131072 bytes)/flush Total 410255360 bytes sent in 4. 000821 seconds. 102. 542793 MBytes/sec Total 10893 send calls, 37. 662 KBytes/send, Total 3136 congestions, 0. 848714 seconds, 0. 000270 second/congestion Total 150240 bytes recv in 4. 000821 seconds. 0. 037552 MBytes/sec Total 3130 recv calls, 0. 048 KBytes/recv Distribution of log write size (unit is byte): Total 3130 numbers, Sum 410255360, Min 131072, Max 131072, Avg 131072 Exactly 131072 3130 numbers Distribution of log shipping time (unit is microsecond): Total 3130 numbers, Sum 3989623, Min 1262, Max 1622, Avg 1274 From 1024 to 2047 3130 numbers 14
Example 1 • SAP System. Standby not able to keep up with Primary causing the Primary to become slow. Prinary: Database Partition 0 -- Database PR 2 -- Active -- Up 0 days 03: 29 -- Date 08/23/2011 18: 20: 02 HADR Information: Role State Primary Peer Sync. Mode Heart. Beats. Missed Nearsync 0 Log. Gap. Run. Avg (bytes) 5516608460 Connect. Status Connect. Timeout Congested Tue Aug 23 18: 20: 02 2011 (1314094802) 120 Local. Host BPPDBA Local. Service PR 2_HADR_1 Remote. Host BPPHADR Remote. Service PR 2_HADR_2 Primary. File Primary. Pg S 0001883. LOG 76581 Primary. LSN 0 x 00000 B 1160 F 7 DFAC Stand. By. File Stand. By. Pg S 0001873. LOG 9604 Stand. By. LSN 0 x 00000 B 10181 DC 7 EF Remote. Instance db 2 pr 2 Standby: Database Partition 0 -- Database PR 2 -- Active -- Up 0 days 00: 28: 02 -- Date 08/23/2011 18: 28 HADR Information: Role State Sync. Mode Standby Peer Nearsync Heart. Beats. Missed 0 Log. Gap. Run. Avg (bytes) 5521392333 Connect. Status Connect. Timeout Connected Tue Aug 23 17: 51: 05 2011 (1314093065) 120 Local. Host BPPHADR Local. Service PR 2_HADR_2 Remote. Host BPPDBA Remote. Service. Remote PR 2_HADR_1 Instance db 2 pr 2 Primary. File Primary. Pg S 0001878. LOG 22845 Primary. LSN 0 x 00000 B 10 B 7995 FFB Stand. By. File Stand. By. Pg S 0001867. LOG 82792 Stand. By. LSN 0 x 00000 B 0 F 6 E 7 C 0 F 9 E Stand. By. Rcv. Buf. Used 99% 15
Example 1 From the db 2 pd –hadr output, the HADR receive buffer is full on the Standby. This will cause the Primary to become congested and slowdown as in Near. Sync mode, the Primary waits for acknowledgement that the log has been written to the HADR buffer on the Standby What data to collect to investigate further? ? • db 2 pd –stacks -dpsdbcb • db 2 trc -t -m "*. *. SQLP, SQLPR, SQLO. *. *" • Logs at the time of the slowdown Stacks of the db 2 redow showed stacks waiting on resources 0 x 090000000 C 53 D 72 C sqlo. Xlatch. Conflict + 0 x 2 BC 0 x 090000000 CC 91430 sqlpr. Find. Queue__FP 9 SQLP_DBCBUl. T 2 PUl + 0 x 944 0 x 090000000 CC 90754 sqlp. PRec. Proc. Log__FP 8 sqe. Agent. P 8 SQLP_ACBP 9 SQLP_DBCB + 0 x 970 0 x 090000000 BC 5 B 8 BC sqlp. Parallel. Recovery__FP 8 sqe. Agent. P 5 sqlca + 0 x. AA 8 0 x 090000000 C 7844 AC sqle. Sub. Coord. Process. Request__FP 8 sqe. Agent + 0 x 21 C 0 x 090000000 BD 89 E 0 C Run. EDU__8 sqe. Agent. Fv + 0 x 380 0 x 090000000 C 63 E 748 EDUDriver__9 sqz. EDUObj. Fv + 0 x 78 0 x 090000000 C 5 CB 550 sqlo. EDUEntry + 0 x 57 C Changing PREC_NUM_AGENTS from 32 -> 16 did not help: When looking at PREC_NUM_AGENTS the general performance rule is: § § Default number of replay worker == number of CPUs. Scales well up to 8 Limited gain with 16 and 32 no gain or negative gain > 32 16
Example 1 • Looking at the stack for the db 2 redom 0 x 0900000 EEEF 8 thread_wait + 0 x 98 0 x 090000000 C 54108 C sqlo. Wait. EDUWait. Post + 0 x 2 E 0 0 x 090000000 CCA 2 D 20 sqlpr. Wait. During. PRec__FP 8 sqe. Agent. P 16 SQLO_EDUWAITPOST + 0 x. D 8 0 x 090000000 CCA 3 D 80 sqlpr. Get. Wait. Others__FP 8 sqe. Agent. P 10 SQLPR_PRCBPUl + 0 x 1 D 4 0 x 090000000 CCA 208 C sqlpr. Block. Tsp__FP 8 sqe. Agent. P 9 SQLP_DBCBP 20 SQLPR_PR_QUEUE_ENTRYUl. PUl + 0 x 310 0 x 090000000 CC 9 B 268 sqlp. PRec. Read. Log__FP 8 sqe. Agent. P 8 SQLP_ACBP 9 SQLP_DBCB + 0 x 39 DC 0 x 090000000 BE 019 C 4 sqlp. Parallel. Recovery__FP 8 sqe. Agent. P 5 sqlca + 0 x 10 0 x 090000000 C 7844 AC sqle. Sub. Coord. Process. Request__FP 8 sqe. Agent + 0 x 21 C 0 x 090000000 BD 89 E 0 C Run. EDU__8 sqe. Agent. Fv + 0 x 380 0 x 090000000 C 63 E 748 EDUDriver__9 sqz. EDUObj. Fv + 0 x 78 0 x 090000000 C 5 CB 550 sqlo. EDUEntry + 0 x 57 C The db 2 redom is performing a block tablespace operation, but is waiting on the WOCB (tablespace waiter control block) • Setting QLB_CFG_PREC_NUM_WOCB = 2048 in DB 2 BPVARS provided a significant improvement. • Still some congestion was seen in the next round of testing • From the next set of data the log files were analyzed for the specific log record types to identify any blocking records 17
Example 1 • A large number of tablespace blocking operations were seen e. g. in S 0001153. LOG. fmt, below are tbspc blocking: 7898 GET_ONE_EXT //bps 7479 GETPAGE_DP //index 1284 MULTI_PAGE_CREATE_DP //dms 7896 OBJ_ADD_EXT //bps 7479 SPLIT_NODE_BIDI_DP //index • • • GET_ONE_EXT and OBJ_ADD_EXT are used to allocate extent. MULTI_PAGE_CREATE_DP is used to extend an APPEND MODE table GETPAGE_DP and SPLIT_NODE_BIDI_DP are used when the index page is full 18
Example 1 • Hence, we can see there a number of tablespace blocking operations against the same object or a number of objects within the same tablespace. • GET_ONE_EXT, OBJ_ADD_EXT and SPLIT_NODE_BIDI_DP are all impacted by the tablespace extent size. • This is a SAP system, with a default extent size of 2!! • MULTI_PAGE_CREATE_DP, is used for append mode tables. • Overall Resolution • QLB_CFG_PREC_NUM_WOCB = 4096 • Move hot tables to their own tablespace with a larger extent size • Change hot tablespaces to have a larger extent size 19
Example 2 • DB 2 9. 7 HADR congestion slow down on Primary during a particular SAP batch job Database Partition 0 -- Database P 61 -- Standby -- Up 1 days 04: 21: 57 -- Date 05/21/2011 00: 30: 04 HADR Information: Role State Standby Peer Sync. Mode Heart. Beats. Missed Nearsync 0 Log. Gap. Run. Avg (bytes) 254806591 Primary. File Primary. Pg S 0124861. LOG 6420 Primary. LSN 0 x 000016 FD 6 E 854 D 4 C Stand. By. File Stand. By. Pg S 0124860. LOG 9145 Stand. By. LSN Stand. By. Rcv. Buf. Used 0 x 000016 FD 5 F 42 C 1 DE 100% db 2 redow: Most of the time redow spent on sqlpr. Find. Queue(). db 2 redow will sleep when we did not find any queue to work on, but HADR stby buffer is full - 100%!! 0 x 0900000 EDE 58 thread_wait + 0 x 98 0 x 090000000 A 614 A 7 C sqlo. Wait. EDUWait. Post + 0 x 54 0 x 0900000008 E 29558 sqlpr. Wait. During. PRec__FP 8 sqe. Agent. P 16 SQLO_EDUWAITPOST + 0 x. D 0 0 x 0900000008 E 29178 sqlpr. Find. Queue__FP 8 sqe. Agent. P 9 SQLP_DBCBUl. T 3 PUl + 0 x 674 0 x 0900000008 E 281 D 4 sqlp. PRec. Proc. Log__FP 8 sqe. Agent. P 8 SQLP_ACBP 9 SQLP_DBCB + 0 x 548 0 x 0900000008 E 20 A 94 sqlp. Parallel. Recovery__FP 8 sqe. Agent. P 5 sqlca + 0 x. C 2 C 0 x 090000000631 F 67 C sqle. Sub. Coord. Process. Request__FP 8 sqe. Agent + 0 x 208 0 x 0900000005 D 9536 C Run. EDU__8 sqe. Agent. Fv + 0 x 3 B 8 0 x 090000000 A 4 B 38 D 4 EDUDriver__9 sqz. EDUObj. Fv + 0 x. D 8 0 x 090000000 A 51 FBC 0 sqlo. EDUEntry + 0 x 260 20
Example 2 • db 2 redom was on sqlp. PRec. Read. Log->sqlpr. Wait. During. PRec() for more than 5 min (accorging to the stacks). Per source code, redom was processing a db-wide blocking record. 0 x 0900000 EDE 58 0 x 090000000 A 614 A 7 C 0 x 0900000008 E 29558 0 x 0900000008 E 2412 C 0 x 0900000008 E 8 F 928 0 x 090000000631 F 67 C 0 x 0900000005 D 9536 C 0 x 090000000 A 4 B 38 D 4 0 x 090000000 A 51 FBC 0 thread_wait + 0 x 98 sqlo. Wait. EDUWait. Post + 0 x 54 sqlpr. Wait. During. PRec__FP 8 sqe. Agent. P 16 SQLO_EDUWAITPOST + 0 x. D 0 sqlp. PRec. Read. Log__FP 8 sqe. Agent. P 8 SQLP_ACBP 9 SQLP_DBCB + 0 x 1 AC 8 sqlp. Parallel. Recovery__FP 8 sqe. Agent. P 5 sqlca + 0 x 14 sqle. Sub. Coord. Process. Request__FP 8 sqe. Agent + 0 x 208 Run. EDU__8 sqe. Agent. Fv + 0 x 3 B 8 EDUDriver__9 sqz. EDUObj. Fv + 0 x. D 8 sqlo. EDUEntry + 0 x 260 vim +6385 /vbs/engn/sqp/sqlprfwd. C 6383: if (wait. For. Max. Lsn) 6384: { <<<< 6385: rc = sqlpr. Wait. During. PRec(agt. CB, &prcb->q. CB. reader. Wait); >>>> • Log summary of one of the congestion periods: between 23: 30 and 23: 45, 31 generated on primary and all consumed on standby <<== 0% Stand. By. Rcv. Buf. Used between 23: 45 and 00: 00, 32 generated on primary and all consumed on standby <<== 0% Stand. By. Rcv. Buf. Used between 00: 00 and 00: 15, 26 generated on primary and 25 consumed on standby <<== 100% Stand. By. Rcv. Buf. Used between 00: 15 and 00: 30, 26 generated on primary and 27 consumed on standby <<== STILL 100% Stand. By. Rcv. Buf. Used between 00: 30 and 00: 45, 25 generated on primary and 25 consumed on standby <<== 89% Stand. By. Rcv. Buf. Usedbetween 00: 45 and 01: 00, 24 generated on primary and 25 consumed on standby <<== 0% Stand. By. Rcv. Buf. Used 21
Example 2 During the period 0. 00 to 0. 15 these log records were seen: LGOP_BSEGFREED_DP LGOP_BSEGALLOC_DP LGOP_LFDATA_DP SQLI_LRT_IP_UPDATE_ROOTVCTR SQLI_LRT_CREATE_INX SQLD_LRT_ADD_IXCB_NON_CREATE SQLI_LRT_CREATE_PAGE SQLDX_LOGOP_BSEGOP SQLD_LRT_CHANGELIFELSN SQLD_LRT_TRUNC_TABLE From the log file the truncate operations were at 2011 -05 -21 -05. 09. 04 and 05. 09. 05, these are in GMT. So GMT -5 gives 00. 09. 04 and 00. 09. 05 124827. summary: 0 B = 11 : SQLD_LRT_TRUNC_TABLE 05. 09. 04 124827. summary: 0 B = 11 : SQLD_LRT_TRUNC_TABLE 05. 09. 05 253413 LSN = 00001718303 E 4 CEC 18 2011 -05 -21 - 269475 LSN = 000017183075 EE 96 18 2011 -05 -21 - The following operations are DB blocking SQLD_LRT_TRUNC_TABLE, SQLD_LRT_CHANGELIFELSN, SQLD_LRT_DEL_IXCB, SQLD_LRT_ADD_IXCB_NON_CREATE SAP batch job performs truncate table and drop index operations!! 22
Introduction • The operations SQLD_LRT_DEL_IXCB, SQLD_LRT_ADD_IXCB_NON_CREATE were changed from DB blocking to tablespace blocking. (APAR IC 79826) DB 2_HADR_BUF_SIZE was set to 15 x times LOGBUFSIZ or the same size of one log file!! • So when the standby gets slowed down on a log record, there is little capacity to handle the SAP batch job!! • Overall Resolution • Increase DB 2_HADR_BUF_SIZE to 4 x logfilsiz • Apply APAR (IC 79826) that was fixed as part of this PMR When drop index operations are replayed on standby, they slow down redo threads such that standby buffer gets full and causes HADR congestion, which will, in turn, block commits on the HADR primary side. • Change SAP workload 23
Summary • Key data to collect (several iterations of) • DB 2 • • • db 2 pd –stacks -edus db 2 pd –dpsprcb –dbpdbcb db 2 trc –t –Madd hdr –Madd sqpr log files from during the slowdown db 2 support If needed use HADRTrigger. Primary. pl and HADRTrigger. Standby. pl to trigger the data collection on the Primary and Standby • OS • • netstat (-s, –D) vmstat iostat (-d, –D) ps (-elf, aux) 24
Questions ? 25
Appendix • WOCB is a 64 -byte memory control block which is used by redo workers for synchronization purpose. it is allocated by redo master, only for those log records which require some types of synchronization in log replay, e. g. tablespace level synchronization, page level synchronization. The default # of WOCB is 128. if WOCBs are exhausted, redom has to wait for redo worker to free some WOCB. increasing # of WOCB decreases the possiblity of blocking of redo master, so it could help improve redo performance. There is no disadvantage to set this parameter big except the additional memory consumption in the dbheap. Increase WCOB can give good performance when there may be no so many tbspc blocking log records, so we can workaround this section of tough workload by increasing WOCB. it just makes redo master be able to put more tbspc blocking log records on the queues, so that it could not be blocked by allocation of WOCB. - it does not make redow run faster. WOCB cannot help when there many tablespace blocking log records, especially when they are against the same tablespace. 26
Slide Title (can be up to two lines of text) • First Topic • Second Topic • Sub-Topic A • Secondary Sub-Topic 1 • Secondary Sub-Topic 2 • Sub-Topic B • Sub-Topic C • Third Topic • Fourth Topic 27
Insert Name Here Insert Company Name Here Insert E-mail here Session Title Please fill out your session evaluation before leaving!
- Slides: 28