DATABASE MYSTERIES BATCH JOB DIAGNOSTICS Chris Lawson May

DATABASE MYSTERIES: BATCH JOB DIAGNOSTICS Chris Lawson May 4, 2007 "It is the brain, the little gray cells on which one must rely. ” -- Hercule Poirot

Our Agenda for Today • A fun trivia quiz • Overview: different types of diagnostics • Preparing for Oracle Statspack • Running Statspack reports • Limitations of Statspack “Use the • Some helpful Sql scripts cells little gray Hastings” • Case study: Diagnosing “eco_out” batch job. • Awarding of prize for correct trivia quiz answer And now, the Trivia Quiz. . .

TRIVIA QUIZ Oracle’s Statspack does not show … • • • Snapshot start/end times High disk i/o Sql High # of execution Sql High # of parse Sql Init. ora parameters “I “ITL” locking events Top 5 wait events Database privilege problems Database memory statistics say, Poirot!” Vote now--prize at end of presentation!

Here are 2 Kinds of Diagnostics that identify problems: • Simple run-time summaries, provided by application, show runtime patterns. • Graphs show server load or disk i/o over time. Diagnostics that help resolve problems: • Oracle-specific utilities, such as Statspack, that identify problem areas. • Custom scripts that identify long-running transactions & poorly tuned Sql.

Run-Time Summaries • Most batch job/data warehouse jobs provide runtime stats. • Some even show the runtimes for each step. • Bill. Pay lists batch job stats in table Wfjob_Ctl: SQL> desc Wfjob_Ctl Name Null? Type --------------BATCH_DATE NOT NULL VARCHAR 2(8) JOB_NAME NOT NULL VARCHAR 2(40) JOB_RUN_NUMBER NOT NULL NUMBER JOB_STATUS VARCHAR 2(2) JOB_RETURN_CODE NUMBER JOB_START_TIME DATE JOB_END_TIME DATE HOSTNAME VARCHAR 2(20)

Sample Diagnostic: Find All Slow Batch Jobs Show jobs in past ½ day running > 20 mins: Select Job_name, To_char(job_start_time, 'mon-dd Hh 24: mi') STTME, To_char(job_end_time, 'mon-dd Hh 24: mi') ENDTME, Round((job_end_time-job_start_time)*24*60) MINS From Wfjob_ctl Where Job_start_time > Sysdate -. 5 And (Job_end_time-job_start_time)*24*60 > 20 Order By 2

Find Slow Batch Jobs List jobs in past ½ day running longer than 20 mins: JOB_NAME ---------------BP. westbp. gen_dailyrun. C BP. westbp. sam_gen_tkconf. C BP. westbp. prc_instalerts. C BP. westbp. cfin_prc_svcrs 2. C STTME ------dec-18 04: 34 dec-18 06: 00 dec-18 06: 01 dec-18 08: 11 ENDTME MINS ------dec-18 06: 08 94 dec-18 06: 36 36 dec-18 06: 33 32 dec-18 08: 31 20

Sample Report: Find ecoout Runtimes for ecoout job for last 3 days: select job_name, to_char(job_start_time, 'DAY') wekday, TO_CHAR(job_start_time, 'MON-DD hh 24: mi') STTME, TO_CHAR(job_end_time, 'MON-DD hh 24: mi') ENDTME, ROUND((job_end_time-job_start_time)*24*60) MINS from wfjob_ctl WHERE JOB_START_TIME > SYSDATE - 3 and job_name like '%ecoout_gen_pmt%' order by job_start_time JOB_NAME WEKDAY STTME ENDTME MINS ------------ ------ ---BP. westbp. ecoout_gen_pmt SATURDAY DEC-16 08: 01 DEC-16 08: 48 48

Find alert Runtimes for all alerts job for last 2 days: select job_name, TO_CHAR(job_start_time, 'MON-DD hh 24: mi') STTME, TO_CHAR(job_end_time, 'MON-DD hh 24: mi') ENDTME, ROUND((job_end_time-job_start_time)*24*60) MINS from wfjob_ctl WHERE JOB_START_TIME > SYSDATE - 2 and job_name like '%alrt%' order by job_start_time JOB_NAME --------------BP. westbp. alrt_prc_daily. C BP. westbp. alrt_cln_pmt. C BP. westbp. alrt_cln_daily. C BP. westbp. alrt 2_gen_pmtmd. C BP. westbp. alrt 2_gen_ebnr. C STTME ------DEC-16 10: 43 DEC-16 11: 39 DEC-18 03: 31 ENDTME MINS ------DEC-16 11: 39 56 DEC-16 11: 44 5 DEC-16 13: 08 89 DEC-18 03: 46 16 DEC-18 03: 38 7

Another Type of Diagnostic: Server Load Charts • This is a good way to spot trends and can help in capacity planning. • We’ve used these graphs to launch investigation after odd spikes. • Server load graphs help isolate the time period having trouble. • This method is typically a visual inspection.

Problem Identification: Server Load Graphs

Spikes Help Identify Problem

Bill-Pay Late January Problem Root cause: CSA users began running up to 8 bad SQL simultaneously

Introduction to Oracle Statspack • Statspack is an Oracle utility that lists performance statistics for a given database. This is called a “Statspack Report. ” • You define the time period of interest by specifying the start-time and end-time. • The stats come from taking regular “snapshots” of the database--typically once per hour (changeable. )

Oracle Statspack: Snaps • The snaps taken are listed in Oracle table Stats$Snapshot • Here’s how to find the snaps taken in last day: Select Snap_id, To_char(snap_time, 'dd-mon-yy-hh 24: mi') Snaptime From Stats$snapshot Where Snap_time > Sysdate 1 Order By 1 SNAP_ID -------4237 4238 4239 * SNAPTIME -------21 -dec-06 -02: 49 21 -dec-06 -03: 49 21 -dec-06 -04: 49 * *

Starting a Statspack Report SQL> @spreport Specify the Begin and End Snapshot Ids ~~~~~~~~~~~~~~~~~~~ Enter value for begin_snap: 1001 Enter value for end_snap: 1002 Specify the Report Name ~~~~~~~~~~~~ The default report file name is sp_1001_1002. To use this name, press <return> to continue, otherwise enter an alternative. Enter value for report_name: TESTREPORT

Statspack Key Sections • • Top 5 resource-consuming events All events causing waits Top SQL ordered by logical i/o Top SQL ordered by physical i/o Top SQL ordered by # of executions Aggregate db statistics Most-read objects Most-locked objects

Statspack: Top 5 Events Top 5 Timed Events ~~~~~~~~~ Single-block reads Event Waits --------------db file sequential read 6, 672, 182 CPU time db file scattered read 2, 611, 855 log file sync 63, 184 ARCH wait on SENDREQ 261 Multi-block reads Write of transaction after Commit % Total Time (s) Ela Time -------25, 852 62. 21 11, 887 28. 61 2, 831 6. 81 208. 50 199. 48

Statspack: List of “Top Sql” • If you know the time period, Statspack will usually zero-in on the problem. • Shows “resource hogs” that consume lots of disk, logical reads, physical reads, or huge number of executions. • Nearly all bad Sql will be detected—but a few can slip-through undetected! • E. g. , what if user is blocked for a long time? That consumes neither disk or much logical i/o, so it’s missed. • More exceptions later.

Statspack: Top Sql Example Top resource consumer—logical reads CPU Elapsed Buffer Gets Executions Gets per Exec %Total Time (s) Hash Value ---------------- ---------195, 093, 132 109, 303 1, 784. 9 17. 5 8750. 55 97170. 32 3351508081 select cur 1. pae_id, cur 1. mpa_nickname, nvl(cur 1. mpasta_id, -1) mpa_status, cur 3. pmt_start_date last_pay_date, cur 3. pmt_amt_complete, cur 2. pmt_start_date schedule_date, * * * Top resource consumer—# of executions CPU per Elap per Executions Rows Processed Rows per Exec (s) Hash Value --------------- ---------589, 109 589, 066 1. 0 0. 00 3012166360 Module: p_direct_check_inbound_pkg INSERT INTO CBPAYMENT_INSTRUCTION_AUDIT (PMTINSTR_ID, PMT_ID, IN STR_ID, MEM_ID, PAE_ID, MPA_ID, INSTRSTA_ID, PMTINSTRBAT_ID, WF_ * * *

Another type of Report: sprepsql • This report shows you actual execution plan for a particular Sql run previously. • You first identify the Sql by its “hash value, ” shown in the first statspack report: Buffer Gets Executions Gets per Exec %Total Time (s) Hash Value ---------------- ---------22, 988, 572 1, 530 15, 025. 2 4. 1 1939. 33 1957. 49 178837896 UPDATE CBNOTICE_MEMBER SET NTCSTA_ID = 3, NTCMEM_MODIFIER_ID = : 1, NTCMEM_MODIFIED_DATE = : 2 WHERE NTC_ID = : 3 AND MEM_ID = : 4 Note this value

sprepsql: Sample Output for an Update Statement Buffer Gets: Disk Reads: Rows processed: CPU Time(s/ms): Elapsed Time(s/ms): Statement Total -------20, 474, 697 19, 895, 404 4, 401 3, 808 6, 369 Per Execute -------20, 474, 697. 0 19, 895, 404. 0 4, 401. 0 3, 807, 950. 0 6, 369, 227. 7 Total -----3. 17 22. 28 UPDATE MEMBER_ACTIVATION MA SET PMT_ACTIVATION_INDICATOR = : B 2 , PMT_ACTIVATION_DATE = SYSDATE , MODIFIED_BY = : B 1 , MODIFIED_DATE = SYSDATE WHERE MA. PMT_ACTIVATION_INDICATOR IS NULL AND EXISTS (SELECT NULL FROM * * * ----------------------------------------| Operation | PHV/Object Name | Rows | Bytes| Cost | ----------------------------------------|UPDATE STATEMENT |----- 1119347130 ----| | | 529076 | |UPDATE | | | HASH JOIN SEMI | | 925 K| 24 M| 529076 | | TABLE ACCESS FULL |MEMBER_ACTIVATION | 954 K| 18 M| 732 | | TABLE ACCESS FULL |CBPAYMENT | 92 M| 705 M| 499796 | ----------------------------------------

Statspack Good Practices: Tips & Traps • Old Snapshots are occasionally purged, so you can’t go back forever! • Certain problems can fly “under the radar. ” • Example: Statspack lists resource-intensive Sql. But what if the problem is due to the cumulative effect of millions of similar (not identical) Sql? Statspack will miss this because each Sql is below the reporting threshold.

Statspack Good Practices: Tips & Traps Here’s one that most of us missed: Oracle’s statistics are reset if table stats are gathered, so Statspack will rate that Sql as perfect! SECS ----2 EXEC ---2 Ha! I knew that! BUFFER_GETS DISK_READS SQL_TEXT ---------------------10694 10683 select count(*) from chris_waivers After Analyze of “Chris Waivers” SECS ----0 EXEC ---0 BUFFER_GETS DISK_READS SQL_TEXT ---------------------0 0 select count(*) from chris_waivers

Statspack Good Practices: Summary • Statspack is an excellent diagnostic aid in identifying possible bottlenecks—not solving them. • Statspack usually uncovers the resource drivers for a specified period of time. • Remember, however, it will not find everything --a few things can slip by. • Like any tool, it can’t fix the design for you.

My Favorite Diagnostic Scripts I use custom scripts far more often than Statspack. Here are some favorites: 1. 2. 3. 4. 5. Show all Sql running on database. Show the object a user is waiting on. List cumulative wait events for a user. List resource-intensive Sql run recently. Show average speed of disk i/o.

Script 1: Show all Active Sql • This gives you an idea of some of the critical Sql for an application. • If you see the same Sql over and over, it either is inefficient, or run extremely often. • On most OLTP systems, only a handful of sessions are usually active at one point— because everyone else has already got their answer.

Example: Show all Active Sql Select DISTINCT Sid, username, substr(sql_text, 1, 200) stext From V$Session, V$Sql Where status = 'ACTIVE' This eliminates Oracle background processes And username is not null And v$session. sql_hash_value = hash_value and v$session. sql_address = v$sql. address and sql_text not like '%sql_text%' This eliminates this query itself SID USERNAME STEXT ---------------------------128 BMAPP SELECT PM_ACCT_NO FROM CBPAYMENT_METHOD WHERE MEM_ID = : 1 FOR UPDATE NOWAIT

Script 2: Show Object User is Waiting on • This is especially helpful in isolating bottlenecks for a particular Sql. • This shows you where your Sql is stuck —you may be surprised! • On FFIEC, this revealed that most queries were always reading one particular table (or its indexes. )

Actual Example: Show Object User is Waiting on Sele. CT DISTINCT username, object_name, sql_text From V$Session, V$Sql, dba_objects o Where v$session. status = 'ACTIVE' And username is not null and o. object_id = row_wait_obj# And v$session. sql_hash_value = hash_value and v$session. sql_address = v$sql. address and username <> 'SYS' Scanning the primary key index USERNAME OBJECT_NAME SQL_TEXT --------------------ADMAPP PMT_PK WITH bills AS (SELECT /*+ Materialize */ bls_external_billid, blr_bspbiller_id, bls. mem_id, bls. mpa_id, bls_id, TO_CHAR (bls. bls_first_seen, 'yyyy. MMdd. HHmiss') bls_first_seen, bls_notify_sent_to_bsp FROM cbbill_s

Script 3: Session Cumulative Wait Events • This summarizes the bottlenecks for the totality of a particular session. • Especially useful for a long-running batch job. • This shows how much time is due to each type of bottleneck. • This helps you avoid solving problems that aren’t really the bottleneck.

Actual Example: Session Cumulative Wait Events Select Sid, Event, Total_waits, time_waited/100 Timewaited, Average_wait/100 Avgwait, Round(100*total_waits/Time_waited) RATE wait times are in From V$session_event centiseconds Where Sid = 339 And Time_waited > 10000 And Event Not Like '%Net%‘ Order By Timewaited SID ---339 339 EVENT TOTAL_WAITS TIMEWAITED AVGWAIT RATE ------------- -----db file scattered read 2912329 534. 3 0 5451 log file sync 138296 795. 71. 01 174 enqueue 2594 6937. 74 2. 67 0 Blocked by another user

Script 4: Resource-Intensive Sql • Pretty similar to the Statspack Sql report. This often highlights serious problems. • Quickly shows you “resource hogs” for a database—big CPU or disk users. • You set a threshold—such as all SQL consuming over 1 million disk reads, or SQL that has run over 10 hours. • This script exemplifies why we don’t need to “guess” at the cause of performance problems. Although finding root cause is usually easy, applying the fix may not be.

Actual Example: Resource-Intensive Sql Select Round(elapsed_time/1000000/1) Secs, Rows_processed Rowct, Executions, Buffer_gets, Disk_reads, Sql_text From V$sql Where Elapsed_time > 9910000000 And Executions > 0 And Abs(buffer_gets)/(Rows_processed+. 01) > 100 Order By Elapsed_time SECS ---82136 ROWCT EXECUTIONS BUFFER_GETS DISK_READS ----------0 20814258 62545409 3814708 select vcrypttrac 0_. "USER_NODE_LOG_ID" as USER 1_, vcrypttrac 0_. "REQUEST_ID" as REQUEST 2_25_, vcrypttrac 0_. "CLIENT_DEVICE_ID" as CLIENT 3_25_, * * * Surprise! Sql doesn’t do anything! (actual production case)

Script 5: Show Disk i/o Rate • On many systems, disk i/o is a significant performance driver. • You want to eliminate the disk i/o first, but it’s still helpful to know roughly how fast you can perform a typical read. • Numbers typically range from 100 -300 singleblock reads per second • Note: Oracle uses the non-intuitive term, “sequential read” for a single read. • Multi-block reads are called “scattered reads. ”

Example: Show Disk I/O Rate for 2 Different Systems Select EVENT, TOTAL_WAITS, TIME_WAITED, Round(100*total_waits/Time_waited) Rate From V$system_event Where Time_waited > 1000 And Event Like '%Db File Sequential Read%' BILLPAY EVENT TOTAL_WAITS TIME_WAITED RATE ------------- -----db file sequential read 7485200484 2059793947 363 FFIEC EVENT TOTAL_WAITS TIME_WAITED RATE ------------ -----db file sequential read 663982417 2481360621 27 Why is disk so slow?

Case Study: Diagnosing “eco” Batch • Called “ecoout_gen_pmt, ” it is critical Billpay batch, starting about 03: 30 daily. • Tight SLA—Especially for “big payment” days. Job typically runs several hours. Business is concerned that SLA is at risk, especially for Tuesday runs.

Eco-Payment Batch Runtimes Querying wfjob_ctl shows historical runtimes and confirms issue: WEKDAY ----TUESDAY STTME ------JUL-11 03: 47 JUL-18 03: 48 JUL-25 03: 41 AUG-01 03: 51 ENDTME MINS ------JUL-11 06: 33 165 JUL-18 09: 48 360 JUL-25 06: 04 143 AUG-01 07: 55 244 SLA failures

Ecoout: Identify Root Cause The “resource-intensive” script finds this Sql running during the time period: UPDATE CBPAYMENT PMT_UPDATE SET (COLS) = (SELECT “stuff” FROM CBMEMBER M, CBMEMBER_PAYEE_ACCOUNT MPA , CBELEMENT E, ETC. Script shows that Sql ran for 3 hours, consuming 25 million logical reads to process 250 k rows This suggests problem with the way Oracle executed this transaction.

Eco-out Root Cause • • • Running “Explain plan” for Sql shows Oracle combines 2 indexes: “Bitmap conversion from ROWID. ” Execution plan shows these indexes Used: Pmtinstr_instrsta_id_idx ? ? Non-Selective! Pmtinstr_idx So Oracle combines two indexes, one of which is a terrible choice. The optimizer should simply do the obvious: just use the one “good” index. BTW: This happened more than one.

Eco-out Root Cause • • • This optimizer tactic is a notorious weakness in Oracle 9 i--some shops turn-off this feature. In some Billpay batch jobs, we issue an “Alter Session” command to disallow this feature. Our fix: Sql hint to specify the desired index: /*+index(I PMTINSTR_INSTRSTA_ID_IDX) */ So what happened after Sql hint applied?

Eco-Payment New Runtimes

CHALLENGE QUESTION Oracle’s Statspack does not show … • • • Snapshot start/end times High disk i/o Sql High # of execution Sql High # of parse Sql Init. ora parameters Database privilege problems Database memory statistics “ITL” locking events Top wait events

QUIZ ANSWER “This is NOT part of Statspack. ” PRIVILEGE PROBLEMS

“Case Closed” : Good Diagnostic Aids • Consider logs that show job runtime • Statspack: – spreport shows resource usage – sprepsql shows execution plan – In Oracle 10 g, use AWR • Custom scripts that show resource usage such as high CPU or disk i/o • Don’t forget server CPU load graphs.

Case Closed: Tips & Traps • There isn’t just “one way” to always identify performance bottlenecks. • Statspack cannot detect everything; some things fly “under the radar. ” • No one has a script to cover every possible problem. • So, develop a “toolkit” of approaches —custom Sql scripts, graphs, tools, etc.

Questions? “The world is full of obvious things which nobody by any chance ever observes. ” Sherlock Holmes, The Hound of the Baskervilles.