VMware Error Examples Scsi Device IO errors Where

  • Slides: 20
Download presentation
VMware Error Examples

VMware Error Examples

Scsi Device IO errors

Scsi Device IO errors

Where to look for errors Depending on the Version of VMware, SCSI errors are

Where to look for errors Depending on the Version of VMware, SCSI errors are either in: var/log/Messages OR vmkernel The safest way to ensure you are provided the correct logs is to request a Vmsupport Dump. 3 QLogic Confidential Month DD, YYYY

SCSI Errors SCSI codes and Devices Example Error: vmkernel: 8: 23: 44: 19. 128

SCSI Errors SCSI codes and Devices Example Error: vmkernel: 8: 23: 44: 19. 128 cpu 1: 4142)Scsi. Device. IO: 1672: Command 0 x 28 to device "naa. 6005076307 ffc 1370000001008" failed H: 0 x 8 D: 0 x 0 Possible sense data: 0 x 5 0 x 24 0 x 0. Note: In some versions of VMware this error will start with a FCP Breaking the error Down: • Command 0 x 28 – This is the type of command the error is referencing. In this case a 0 x 28 is a read command. A good reference can be found here: http: //en. wikipedia. org/wiki/SCSI_command • naa. 6005076307 ffc 1370000001008" – Is the target of the command. 4 QLogic Confidential Month DD, YYYY

SCSI Errors (Host/Device/Plugin) Codes Example Error: vmkernel: 8: 23: 44: 19. 128 cpu 1:

SCSI Errors (Host/Device/Plugin) Codes Example Error: vmkernel: 8: 23: 44: 19. 128 cpu 1: 4142)Scsi. Device. IO: 1672: Command 0 x 28 to device "naa. 6005076307 ffc 1370000001008" failed H: 0 x 8 D: 0 x 0 Possible sense data: 0 x 5 0 x 24 0 x 0. Breaking the error Down: • H: 0 x 8 D: 0 x 0 P: 0 x 0 -Defines what device is reporting the error. H=Host (initiator) D=Device (target) P= Plugin. Host and Device errors can be decoded here: • Host: http: //kb. vmware. com/selfservice/microsites/search. do? language=en_US&cmd=display. KC&external. Id=1029039 • Device: http: //kb. vmware. com/selfservice/microsites/search. do? language=en_US&cmd=display. KC&external. Id=1030381 • Plugin: http: //kb. vmware. com/selfservice/microsites/search. do? cmd=display. KC&doc. Type=kc&external. Id=2004086&slice. Id=1&doc. Typ e. ID=DT_KB_1_1&dialog. ID=295774258&state. Id=0 0 295772714 • In this error the Host is reporting the error 0 x 8 which indicates the HBA driver has aborted the I/O. It can also occur if the HBA does a reset of the target. Note: The fact that the Host reported the error DOES NOT mean it is the cause. This is a common misconception. 5 QLogic Confidential Month DD, YYYY

SCSI Errors Sense Codes Example Error: vmkernel: 8: 23: 44: 19. 128 cpu 1:

SCSI Errors Sense Codes Example Error: vmkernel: 8: 23: 44: 19. 128 cpu 1: 4142)Scsi. Device. IO: 1672: Command 0 x 28 to device "naa. 6005076307 ffc 1370000001008" failed H: 0 x 8 D: 0 x 0 Possible sense data: 0 x 5 0 x 24 0 x 0. Breaking the error down: 0 x 5 0 x 24 0 x 0 – Is the sense data in Sense key, ASCQ format. Often this is all Zeros, but on occasion sense data is provided. A good key for this information can be found here: http: //en. wikipedia. org/wiki/Key_Code_Qualifier For this error it says: 6 QLogic Confidential Month DD, YYYY

Other Useful Links Interpreting Sense Data in VMware: http: //kb. vmware. com/selfservice/microsites/search. do? cmd=display.

Other Useful Links Interpreting Sense Data in VMware: http: //kb. vmware. com/selfservice/microsites/search. do? cmd=display. KC&doc. Type=kc&external. Id=289902&slic e. Id=2&doc. Type. ID=DT_KB_1_1&dialog. ID=255380300&state. Id=1 0 255382579 Common Errors that result in ESX failing a LUN to another path: http: //kb. vmware. com/selfservice/document. Link. Int. do? microsite. ID=&popup=true&language. Id=&external. ID=100 3433 7 QLogic Confidential Month DD, YYYY

Sample errors to Decode: 1) Scsi. Device. IO: 1672: Command 0 x 12 to

Sample errors to Decode: 1) Scsi. Device. IO: 1672: Command 0 x 12 to device "eui. 00173800049 f 0000" failed H: 0 x 0 D: 0 x 2 P: 0 x 0 Valid sense data: 0 x 2 0 x 4 0 x 0. • Command 0 x 12 = Inquiry • eui. 00173800049 f 0000 = Device name • D: 0 x 2 = Device (initiator) reported a Check Condition • Sense Data 0 x 2 0 x 4 0 x 0 = Not Ready - Cause not reportable The examples below will often be seen along side the Scsi. Device IO errors. They indicate the same events but as reported by the QLogic driver: Error Meaning vmkernel: 8: 23: 44: 19. 128 cpu 7: 4142)<6>qla 2 xxx 0000: 09: 00. 0: scsi(8: 0: 0): Abort command issued -- 1 f 0784 a 4 2002. Indicates that a SCSI command abort has been issued to the target vmkernel: qla 2 xxx 0000: 03: 00. 0: scsi(1: 0: 1): DEVICE RESET ISSUED. Indicates a Device (LUN) reset has been issued to the target vmkernel: qla 2 xxx 0000: 03: 00. 0: scsi(1: 0: 1): DEVICE RESET SUCCEEDED Indicates a Device (LUN) reset has been Successfully processed by the target 8 QLogic Confidential Month DD, YYYY

ASYNC errors Example errors: Error Message Meaning /Indication cpu 4: 4100)scsi(5): Asynchronous PORT UPDATE

ASYNC errors Example errors: Error Message Meaning /Indication cpu 4: 4100)scsi(5): Asynchronous PORT UPDATE ignored 0000/0004/0600 This error indicates a fabric disruption occurred. scsi(5): Asynchronous LOOP UP (10 Gbps). This indicates the loop came up at the noted speed (in this case 10 G) scsi(5): Asynchronous LOOP DOWN (10 Gbps). This indicates the loop came down at the noted speed (in this case 10 G) 9 QLogic Confidential Month DD, YYYY

Link status messages Example errors: Message Meaning / Indication vmkernel: 0: 00: 21: 01.

Link status messages Example errors: Message Meaning / Indication vmkernel: 0: 00: 21: 01. 647 cpu 2: 129) <6>scsi(0): LOOP DOWN detected. The Loop is Down vmkernel: 0: 00: 21: 54. 032 cpu 2: 129) <6>scsi(0): LOOP UP detected. The Loop is Up vmkernel: 0: 00: 21: 26. 285 cpu 0: 139) <6>scsi(0): Cable is unplugged. . . The physical link is down 10 QLogic Confidential Month DD, YYYY

VMFS Heartbeat errors

VMFS Heartbeat errors

VMFS Heartbeat errors Example Error: vobd: Mar 01 13: 24: 16. 429: 776658042771 us:

VMFS Heartbeat errors Example Error: vobd: Mar 01 13: 24: 16. 429: 776658042771 us: [esx. problem. vmfs. heartbeat. timedout] 4 f 44 b 9 b 5 -5 c 051 bb 1 -12 a 0 -001018 XXXXXX VT-315 -CLU-XXXXXX. Breaking the error down: • esx. problem. vmfs. heartbeat. timedout = This indicates that the ESX host connectivity to the volume degraded due to the inability of the host to renew its heartbeat for period of approximately 16 seconds (the VMFS lock breaking lease timeout). • After the periodic heartbeat renewal fails, VMFS declares that the heartbeat to the volume has timed out and suspends all I/O activity on the device until connectivity is restored or the device is declared inoperable. • 4 f 44 b 9 b 5 -5 c 051 bb 1 -12 a 0 -001018 XXXXXX = This is the UUID for the volume the error is referring to. • VT-315 -CLU-XXXXXX = This is the volume the error is referring to. 12 QLogic Confidential Month DD, YYYY

Link failure and recovery These errors are Typically followed by errors confirming loss of

Link failure and recovery These errors are Typically followed by errors confirming loss of the connection like below: • Hostd: [2012 -03 -01 13: 24: 16. 429 FFEF 8 B 90 info 'ha-eventmgr'] Event 73 : Lost access to volume 4 f 44 b 9 b 5 -5 c 051 bb 1 -12 a 0 -001018 XXXXXX (VT-315 -CLU-XXXXXX) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. Once the link recovers a similar set of errors will be logged: • Mar 1 13: 24: 21 vmkernel: 8: 23: 44: 24. 620 cpu 2: 4142)FS 3: 398: Reclaimed heartbeat for volume 4 f 44 b 9 b 55 c 051 bb 1 -12 a 0 -001018 XXXXXX (VT-315 -CLU-XXXXXX): [Timeout] [HB state abcdef 02 offset 4063232 gen 5 stamp 776664617984 uuid 4 f 439 f 16 -eb 47 d 490 -548 c-00215 e 5 de 25 c jrnl <FB 29008> drv 8 • Mar 1 13: 24: 21 vobd: Mar 01 13: 24: 21. 922: 776663535861 us: [esx. problem. vmfs. heartbeat. recovered] 4 f 44 b 9 b 5 -5 c 051 bb 1 -12 a 0 -001018 XXXXXX VT-315 -CLU-XXXXXX. • Mar 1 13: 24: 21 Hostd: [2012 -03 -01 13: 24: 21. 922 32 EAFB 90 info 'ha-eventmgr'] Event 74 : Successfully restored access to volume 4 f 44 b 9 b 5 -5 c 051 bb 1 -12 a 0 -001018 XXXXXX (VT-315 -CLU-XXXXXX) following connectivity issues. 13 QLogic Confidential Month DD, YYYY

Misc. Log Messages

Misc. Log Messages

Log Messages and meanings NOTE: These messages may be slightly different in newer versions

Log Messages and meanings NOTE: These messages may be slightly different in newer versions of VMware Message Meaning "qla 2 x 00_set_info starts at address = xxxx" Driver is reporting the starting address where the driver was loaded in case an oops occurs in the driver. "qla 2 x 00: Found VID=xxxx DID=yyyy SSVID=zzzz SSDID=vvvv" Driver is reporting which adapter it has found during initialization. "scsi(%d): Allocated xxxxx SRB(s)" Driver is reporting the number of simultaneous commands that can be executed by the adapter. The max_srbs option can change this number. "scsi(%d): 64 Bit PCI Addressing Enabled" Driver is reporting that it has configured the adapter for 64 bit PCI bus transfers. "scsi(%d): Verifying loaded RISC code. . . " Driver is reporting that it has verified the RISC code and it is running. "scsi(%d): Verifying chip. . . " extended" Driver is reporting that it has verified the chip on the adapter. "scsi(%d): Waiting for LIP to complete. . . " Driver is reporting that it is waiting on the firmware to become ready. "scsi(%d): LIP occurred, . . . " Driver received a LIP async event from the firmware. "scsi(%d) LOOP UP detected" Driver received a loop up async event from the firmware. "scsi(%d) LOOP DOWN detected" Driver received a loop down async event from the firmware. "scsi(%d): Link node is up" Driver received a point-to-point async event from the firmware. "scsi%d: Topology - (%s), Host Loop address 0 x 0" Indicates the firmware connection type. %s will be one of the following: FL-PORT, N-PORT, FPORT, NL-PORT, and host adapter loop ID. 15 QLogic Confidential Month DD, YYYY

Initialization messages and meanings (cont. ) Message Meaning "scsi%d : QLogic XXXXXX PCI to

Initialization messages and meanings (cont. ) Message Meaning "scsi%d : QLogic XXXXXX PCI to Fibre Channel Host Adapter: . . . " Firmware version: 4. 06, Driver version 7. 08 vm 62" Driver is reporting information discovered during its initialization. This information includes the board ID, firmware version, and driver version. "qla%d Loop Down - aborting ISP" Indicates driver is attempting to restart the loop by resetting the adapter. Usually done by the driver when sync is not detected by the firmware for a long time (4+ minutes), and usually means that the adapter port is not connected to the switch or loop. "scsi(%d): %s asynchronous Reset. " Driver received an async reset event from the firmware. %s indicates the function name. "qla 2 x 00: ISP System Error - mbx 1=%x, mbx 2=%x, mbx 3=%x" Driver received an async ISP system error event from the firmware. Additional information follows the message (that is, mailbox values from the firmware). "scsi(%d): Configuration change detected: value %d. " Driver received a change in connection async event from the firmware. Additional information follows the message (that is, mailbox 1 value from the firmware). "scsi(%d): Port database changed" Driver received a port database async event from the firmware. "scsi(%d): RSCN, . . . " Driver received a registered state change notification (RSCN) async event from the firmware. Additional information follows the message (that is, mailbox values from the firmware "%s: Can't find adapter for host number %dn" Indicates that the read from /proc/scsi/qla 2 X 00 did not specify the correct adapter host number. %s indicates the function name. "scsi(%d): Cannot get topology - retrying" Firmware return status indicating it is busy. "%s(): **** SP->ref_count not zeron" Indicates a coding error. %s is the function name. 16 QLogic Confidential Month DD, YYYY

Initialization messages and meanings (cont. ) Message Meaning "qla_cmd_timeout: State indicates it is with

Initialization messages and meanings (cont. ) Message Meaning "qla_cmd_timeout: State indicates it is with ISP, But not in active array" Indicates a coding error. %s is the function. "cmd_timeout: LOST command state = 0 x%xn" Indicates the command is in an undefined state. 0 x%x indicates the state number. "qla 2 x 00: Status Entry invalid handle" Driver detected an invalid entry in the ISP response queue from the firmware. %x indicates the queue index. "%s(): **** CMD derives a NULL TGT_Qn" Indicates the command does not point to an OS target. "scsi(%ld: %d: %d): DEVICE RESET ISSUED. n" Indicates a device reset is being issued to (host: bus: target: lun). "scsi(%ld: %d: %d): LOOP RESET ISSUED. n" Indicates a loop reset is being issued to (host: bus: target: lun). "%s(): **** CMD derives a NULL HAn“ Or "%s(): **** CMD derives a NULL search HAn" Indicates the command does not point to the adapter structure. "scsi(%ld: %d: %d): now issue ADAPTER RESET. n" Indicates an adapter reset is being issued to (host: bus: target: lun). "scsi(%d): Unknown status detected %x-%x" Indicates the status returned from the firmware is not supported. %x-%x is the completion-scsi statuses. "scsi(%ld: %d: %d): Enabled tagged queuing, queue depth %d. n" Indicates the queue depth for the (host: bus: target: lun). "PCI cache line size set incorrectly (%d bytes) by BIOS/FW, " Indicates a correction in the cache size. %d is the cache size. "scsi(%d): Cable is unplugged. . . " Indicates the firmware state is in LOSS OF SYNC; therefore, the cable must be missing. 17 QLogic Confidential Month DD, YYYY

Initialization messages and meanings (cont. ) Message Meaning "qla 2 x 00: Performing ISP

Initialization messages and meanings (cont. ) Message Meaning "qla 2 x 00: Performing ISP error recovery - ha=%p. " Indicates the driver has started performing an adapter reset. "qla 2 x 00_abort_isp(%d): **** FAILED ****" Indicates the driver failed performing an adapter reset. "%s(%ld): RISC paused, dumping HCCR (%x) and schedule an ISP abort (big-hammer)n“ Indicates the driver has detected the RISC in the pause state. "scsi(%ld): Mid-layer underflow detected (%x of %x bytes) wanted "%x bytes. . . returning DID_ERROR status!n" Indicates an underflow was detected. "%s(): Ran out of paths - pid %d" Indicates there are no more paths to try for the request. %s is the function name and %d is the mid-level processor identifier (PID). "WARNING %s(%d): ERROR Get host loop ID" Firmware failed to return the adapter loop ID. "WARNING qla 2 x 00: couldn't register with scsi layern" Indicates the driver could not register with the SCSI layer, usually because it could not allocate the memory required for the adapter. "WARNING scsi(%d): [ERROR] Failed to allocate memory for adaptern" Indicates the driver could not allocate all the kernel memory it needed. "WARNING qla 2 x 00: Failed to initialize adaptern" Indicates that a previously occurring error is preventing the adapter instance from initializing normally. "WARNING scsi%d: Failed to register resources. n" Indicates the driver could not register with the kernel. "WARNING qla 2 x 00: Failed to reserve interrupt %d already in usen" Indicates the driver could not register for the interrupt IRQ because another driver is using it. 18 QLogic Confidential Month DD, YYYY

Initialization messages and meanings (cont. ) Message Meaning "WARNING qla 2 x 00: ISP

Initialization messages and meanings (cont. ) Message Meaning "WARNING qla 2 x 00: ISP Request Transfer Error" Driver received a Request Transfer Error async event from the firmware. "WARNING qla 2100: ISP Response Transfer Error" Driver received a Response Transfer Error asynchronous event from the firmware. "WARNING Error entry invalid handle" Driver detected an invalid entry in the ISP response queue from the firmware. This error will cause an ISP reset to occur. "WARNING scsi%d: MS entry - invalid handle" Driver detected a management server command timeout. 19 QLogic Confidential Month DD, YYYY

Log messages valid only for ISP 82 xx Message Meaning qlcnic 0000: 0 d:

Log messages valid only for ISP 82 xx Message Meaning qlcnic 0000: 0 d: 00. 1: PEG_HALT_STATUS 1: 0 x 0, PEG_HALT_STATUS 2: 0 x 0. This message indicates that there has been a peg fault. A driver reset should follow this. qla 2 xxx 0000: 0 d: 00. 7: HW State: FAILED qla 2 xxx 0000: 0 d: 00. 7: Disabling the board Firmware has fatally failed and the board will now be disabled. qla 2 xxx 0000: 0 d: 00. 7: qla 2 xxx: RESET TIMEOUT! drv_state= 0 x 4 drv_active=0 x 6 The timeout for the successful completion of the reset of the firmware. qla 2 xxx: Initialization TIMEOUT! The timeout for the successful completion of the Initialization of the firmware has occurred. qla 2 xxx 0000: 0 d: 00. 7: HW State: QUIESCENT Firmware has been put into quiescent state. qla 2 xxx 0000: 0 d: 00. 7: HW State: READY Firmware has been initialized properly and is now ready to be used. qla 2 xxx 0000: 0 d: 00. 7: HW State: INITIALIZING State of hardware has been changed to initializing and the device is now being initialized. If properly initialized the device state should change to READY. qla 2 xxx 0000: 0 d: 00. 7: qla 2 xxx: QUIESCENT TIMEOUT! drv_state= 0 x 4 drv_active=0 x 6 The timeout for the firmware to be in the quiescent mode has occurred. qla 2 xxx 0000: 0 d: 00. 7: HW State: NEED RESET qla 2 xxx 0000: 0 d: 00. 7: qla 82 xx_abort_isp(4): reset_owner is 0 x 7 Driver need to reset the hardware and the owner of the reset operation is 0 x 7. 20 QLogic Confidential Month DD, YYYY