i SCSI Past Present Future Robert Russell Computer
i. SCSI: Past, Present, Future Robert Russell Computer Science Department and IOL University of New Hampshire © UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY
Very Brief History • Late 1997 – idea of storage over IP – Julian Satran, IBM research • Late 1999 – IBM and Cisco start joint work on proposal for standard • Early 2000 – IETF creates IP storage working group • November 2000 – IETF draft 0 posted UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Rest of Brief History • Jan 2001 – SNIA creates IP storage forum • July 2001 – first UNH IOL i. SCSI Plugfest – 28 companies attended – Tested drafts 0 and 6 • Feb 2003 – IETF approves draft 20 • June 2003 – Microsoft Server with i. SCSI • April 2004 – IETF publishes RFC 3720 UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Today – December 2005 • i. SCSI products offered by all storage and platform vendors • Many small vendors in the market • i. SCSI now well accepted at low and middle performance ranges • 1 Gig wire-speed HBAs available • 10 Gig i. SCSI products starting to appear UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Other SAN Technologies • Enterprise data centers still based on Fibre Channel (1 Gig, 2 Gig, soon 4 Gig) • Renewed interest in i. FCP and FCIP • Will Fibre Channel equipment prices be lower in the near future? ? • Infini. Band UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
SCSI Transport Protocol for TCP • Based on widely used, off-the-shelf technology – SCSI, TCP, IPsec, Ethernet – Familiar, already installed infrastructure – Commodity components, inexpensive • Permits all-software implementations – Encourages experimentation, early feedback – Many freely distributed implementations UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI Design Principles • Target controls data transfer – To enable fair sharing of resources – To manage limited memory resources – To improve disk performance • Messages (PDUs) in both directions are sequenced and acknowledged – – In addition to TCP sequencing and acknowledging To maintain SCSI command ordering To detect errors To control data flow UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Text Negotiation • Stylistic departure from FC, TCP, etc. Key=value • Used for multiple purposes: Login, authentication, discovery, renegotiation • Easy to use, understand, debug • Slower to process, bigger messages – Used mostly in Login – sessions are long-lived – Linux initiator now split between kernel/user UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Designed-in Extensibility • Text keys and values – Can carry info in both directions - slow • Additional header segments (AHS) – Can carry info from initiator to target - fast • Asynchronous messages – Can carry info from target to initiator - fast UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Error Handling • End-to-end CRCs (digests) useful because TCP has weak checksum Stone and Partridge, ACM SIGCOMM 2000 pp 309 -319 • TCP checksum observed to catch error in every 1 in 1100 to 1 in 32000 segments • Error gets through TCP checksum to application every 1 in 6 million to 1 in 10 billion segments • Markers embedded in stream – little used • 3 levels of recovery to deal with CRC errors and connection loss UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Error Recovery • Complex, many choices of action to take • Poorly tested, may hide bugs • Why so complex? – SCSI error recovery slow, crude – Some applications require absolute accuracy – Compromise after long discussion • Philosophy repudiated by i. SER/i. WARP UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Draft Implementer’s Guide • 3 clarifications – no change to existing code – Over/underflow, reserved ITT, format errors • 2 corrections – minor changes to existing code – Interaction between R 2 Ts on same connection – Handling data digest errors on Reject, Async messages UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Draft Implementer’s Guide • 2 new additions – minor changes to existing code – Task management effecting multiple I_T Nexi • New proposal now under discussion – Reinstating unnamed discovery sessions • To avoid interference with normal sessions • To permit independent discovery sessions based on target addresses UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Relatively Unused Features • Error recovery level 2 • Out of order PDUs and/or PDU sequences • Multiple connections (scheduling policies? ) • Use with IPsec (management) • Bidirectional commands (only 1 in SBC-2) • Additional header segments (AHS) • Markers UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: RFC 3720 Document • Long, informal English prose • Ambiguous, can be misinterpreted • Testing is long, has many combinations • Need for use of formal methods for specification, verification, testing Bishop et al. , ACM SIGCOMM 2005 pp 265 -276 Rigorous Specification for TCP and UDP UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Performance Factors • Workload characteristics – Sequential streaming vs random access – Read/write, large/small transfers • Network characteristics – Speed (100, 10000 Mbps) – Distance (LAN, MAN, WAN) – Error rates – Congestion UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Performance Metrics • Bandwidth utilization – high is desirable • CPU utilization – low is desirable • Latency – low is desirable • Transaction rate – high is desirable UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Performance • Numerous studies done, many more to do • Many, many tunable parameters at all levels – SCSI – i. SCSI – TCP – Ethernet • Interactions/tradeoffs within/between levels • Dynamic parameter adjustment UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
SCSI Initiator Parameters • Maximum no. of outstanding commands – Big enough to keep network pipeline full • Maximum no. of sectors per command – Big to allow multi-sector requests • Maximum no. of I/O vectors per command – Big to allow scatter/gather operations • Coalescing contiguous blocks – In order to reduce need for I/O vectors UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Tunable Parameters • PDU size – declared on initiator and target – Usage determined independently by sender – Big enough to keep pipeline full • Out-of-order PDUs – negotiate on/off – Usage determined independently by sender – May be useful when target sends Data. In PDUs – May be bad when initiator sends Data. Out PDUs UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Tunable Parameters • Header/data digests – negotiate on/off – Used by both sides – Catches errors that get through TCP checksum • Error recovery level – negotiate 0, 1 or 2 – Used by both sides – Higher levels give faster, smoother recovery • Markers – negotiate on/off and interval – Used by each side independently – Recovers PDU alignment in TCP stream UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Tunable Parameters • Immediate/unsolicited data – negotiate on/off and maximum – Usage determined by initiator on writes only – May reduce latency on small writes – May increase buffering on target – extra copy • Multiple connections – negotiate maximum – Creation and usage determined by initiator – Scheduling algorithms not yet explored UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Tunable Parameters • Burst sizes – negotiate max – Usage determined independently by target – Big enough to keep pipeline full • Number outstanding R 2 Ts – negotiate max – Usage determined independently by target – Big enough to keep pipeline full UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Tunable Parameters • Phase collapse – internal to target – Eliminates extra response PDU from target • Command window – internal to target – Controls load and buffer usage on target – Big enough to keep pipeline full UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Tunable Parameters • A-bit, Data. Ack SNACK – negotiate ERL > 0 – Usage determined independently by target on reads only – Reduces buffering on target • Out-of-order Sequences – negotiate on/off – Usage determined independently by target – Reduces latency and buffering on target UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
TCP: Tunable Parameters • Maximum window sizes – Bigger generally better • Options for timestamps, window scaling, etc. • Delayed, selective acknowledgements • Nagle algorithm – to coalesce small packets – Turn off except when streaming small PDUs • Dynamic packet coalescing – Better control than Nagle on/off UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Ethernet: Tunable Parameters • Jumbo frames – Improves bandwidth utilization – Decreases CPU overhead – Not supported on all NICs, HBAs, switches • Driver DMA input queue length – Bigger to smooth out traffic bursts • Interrupt coalescing – Trades response time against CPU overhead UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Tradeoff Example: i. SCSI CRC • Use of TOE without i. SCSI CRC off-loaded – Reduces performance due to memory access • Use of TOE with i. SCSI CRC off-loaded – Reduces protection due to bus crossing • Use of TCP copy and i. SCSI CRC in software – Expensive, but performance better for small PDU UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI CRC in Software • 2% reduction for PDUs less than 2 KB • 31% reduction for PDUs bigger than 8 KB UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Graph of Parameter Interaction UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Parameter Relationship • Let N = number of outstanding. R 2 Ts • Let M = Max. Burst. Length (MRDSL) in KB • Then at top of the “knee” in the graph, N x M = 64 • The “pipeline size” at this latency • Target controls N to keep pipeline full • Formula needs additional factor for latency UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Equation for Write Throughput t = A*x 1 + B*x 2 + C*x 3 + D*x 4 + E*x 5 +F t = Time in msec to transmit one MB x 1 = Number of R 2 Ts /command x 2 = Number of data-outs /command x 3 = Immediate data bytes(MB) /command x 4 = Unsolicited data bytes(MB) /command x 5 = Solicited data bytes(MB) /command UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Calculated Coefficients A = 1. 82 msec / R 2 T PDU B = 0. 011 msec / Data. Out PDU C = 115. 29 msec / immediate MB D = 120. 79 msec / unsolicited MB E = 87. 72 msec / solicited MB F = 0 msec UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Write From UNH Initiator Calculated time Observed time UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Write From Windows Initiator UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Memory a Critical i. SCSI Resource • Initiator Paging to an i. SCSI disk – VM system MUST NOT block for memory – Without care, standard TCP stack will block for memory (buffers and control structures) – Without care, i. SCSI data path will block for memory • Target memory starvation – May get multiple commands at once – Must hold memory until receipt acknowledged – Acknowledgement may be delayed indefinitely – Target must send Nop. In or set A-bit on last Data. In UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: CPU Load • CPU utilization is not negligible – Biggest percent from TCP/IP, not i. SCSI or SCSI – Standard TOE off-loading helps output – i. SCSI HBA off-loading helps input and output – Software i. SCSI CRC is expensive for large PDUs UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
CPU Overhead without HBA • Interrupt rate – 1500 byte frame every 12 microsecs on 1 GE – 9000 byte frame every 5 microsecs on 10 GE • Frequent cache flushing • Extra copying – TOEs help mainly on output – Input requires intermediate TCP buffers or costly memory mapping UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. WARP • IETF Remote Direct Data Placement WG • Suite of protocols – RDMAP – to control DDP coherently – DDP – to segment and place data directly – MPA – to align frames in TCP stream – SCTP – to bypass MPA/TCP, map DDP onto IP • Implemented in RNIC – RDMA-aware NIC • For general use, not just i. SCSI/i. SER UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Stack With i. SER/i. WARP R N I C UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. WARP: RNIC Concepts • Manages large transfers without host CPU interaction • Fragments large transfers into TCP segments, each with extra headers • Avoids copying at both ends of the wire • Adds end-to-end CRC checking • Adds markers to handle out-of-order frames UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. WARP: RNIC Benefits • Substantially reduces host overhead – Fewer host interrupts • Once per transfer, not once per frame – Fewer host cache flushes – Less use of host memory space • No network buffers in host memory – Less use of host memory bus • One direct transfer between wire and memory • Better use of network bandwidth • Lower network latency UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. WARP Details • Untagged buffers for control frames – 20 -byte header plus 4 -byte CRC • Tagged buffers for data frames – 16 -byte header plus 4 -byte CRC • Uses IPsec for transmission security • Many other security requirements for RNIC • Error handling philosophy – terminate the connection! UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SER: i. SCSI Extensions for RDMA • Interface between i. SCSI and RDMA • i. SER adds 12 -byte header to control PDUs • Makes i. SCSI independent of any protocol – RDMAP/DDP/MPA/TCP/IP – RDMAP/DDP/SCTP/IP – Infiniband – Others? (Myrinet? , Quadrics? ) UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SER: Concepts • Target controls data flow – i. SCSI read = target RDMA write – i. SCSI write = target RDMA read • 4 new keys • Old keys for digests, markers are irrelevant • Handling of i. SCSI PDUs – R 2 T, Data. Out PDUs replaced by RDMA read – Data. In PDUs replaced by RDMA write – All other PDUs carried by RDMA send UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI/i. SER/i. WARP Error Handling • Guaranteed reliable, in-order deliver • i. SER/i. WARP error terminates connection! • All i. SCSI error recovery levels possible • Level 1 reduced to almost nothing – Digest and sequence errors now impossible – PDU retransmission timeouts discouraged – SNACK must no longer be sent UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Sharing a Target Device üMultiple hosts easily access common target üEfficient block transport directly to disk x No notion of files, directories, data, or metadata x No contention detection or resolution x No allocation or management of blocks UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Object-based Storage System • Idea: push more intelligence onto disk unit – Target manages block allocation – Target defines objects and maps their blocks – Target manages object metadata • Enhancements to SCSI command set • Must rewrite file systems to use objects UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
ANSI Project T 10/1355 -D • SCSI Object-Based Storage Device Commands • Final Revision 10, 30 July 2004 “To provide efficient operation of I/O logical units that manage the allocation, placement and accessing of variable-size data-storage containers called objects. ” UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Object-store • All OSD commands are bi-directional • 200 -byte CDB requires use of i. SCSI AHS – Reading a PDU header requires 2 steps: • Read 48 -byte Basic header, extract AHSLength • Read following AHSLength bytes – Header digest (CRC) is problematic • Use AHSLength value to read AHS headers and CRC • Use CRC to check complete header after read done • If AHSLength has error – input may block! UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Tracking SCSI Standards • i. SCSI version 0 (RFC 3720) based on: – SAM-2 Final revision 24, 11 -September-2002 – SBC Final revision 8, 13 -November-1997 • Upgrade to SAM-3 project T 10/1561 -D? – Final revision 14, 21 -September-2004 • SAM-4 project T 10/1683 -D – Current draft 3, 20 -September-2005 • Upgrade to SBC-2 project T 10/1417 -D? – Final revision 16, 13 -November-2004 UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
SCSI: SAM-3 Changes • Task management command changes • Async event notification removed • Contingent allegiance removed • Untagged tasks removed • Task priority added UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Research Areas • Dynamic parameter adjustment – Respond to changes in application load – Respond to changes in network conditions • Using parameters between levels – E. g. , let i. SCSI use TCP’s RTT to keep pipe full • Negotiate limits but operate at other values – Target controls burst sizes, outstanding R 2 Ts – Initiator controls connections, unsolicited data UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Research Areas • Scheduling multiple connections in session – What criteria to use? – Let different connections carry different info? • Reinstating sessions in order to renegotiate – Only limited differences between connections • New file systems • New caching schemes UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
i. SCSI: Novel Uses • Take advantage of extensibility features • Use AHS to carry extra information with commands from Initiator to Target • Use Async messages to carry extra information from Target to Initiator • Use new text keys to exchange metadata • Use multiple connections to carry different information UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
Initiator Target Applications New File System Disk SCSI i. SCSI Session i. SCSI TCP/IP Link UNIVERSITY of NEW HAMPSHIRE INTEROPERABILITY LABORATORY i. SCSI: Past, Present, Future
- Slides: 56