Transient Analysis CK Cheng UC San Diego Jan

  • Slides: 58
Download presentation
Transient Analysis CK Cheng UC San Diego Jan. 25, 2007

Transient Analysis CK Cheng UC San Diego Jan. 25, 2007

Outline • • • Research Directions Simulation test case results Overview of Simulation Commercial

Outline • • • Research Directions Simulation test case results Overview of Simulation Commercial Package Alternating direction implicit (ADI) Method • General Operator Splitting Method • Distributed Computing • Conclusions and Future Works

Research Directions • • Simulation: SPICE, STA Network on Chip: topology and wire styles,

Research Directions • • Simulation: SPICE, STA Network on Chip: topology and wire styles, Power, and Clock Networks Data Path Components: adders, shifters, multipliers, division • Packaging: passive distortion compensation

6 x 6 Bump Simulation Results • The Circuit: – 184 K Capacitors, 17

6 x 6 Bump Simulation Results • The Circuit: – 184 K Capacitors, 17 K Current Sources, 120 K Inductors and 246 K Resistors. – 306 K Nodes • Accuracy: – Waveform and measurement results match Fujitsu’s with less than 0. 002% error. • Runtime / Memory Comparison: CPU_Time Memory Computer Used UCSD 678 s 600. 2 M Pentium 4 3. 2 G, Linux Fujistu Log File 1845 s 771 M unknown

6 x 6 Bump Simulation Results • Measurement results and waveform Min_pwr_l_est_10000954 Min_18269323 Min_33085875

6 x 6 Bump Simulation Results • Measurement results and waveform Min_pwr_l_est_10000954 Min_18269323 Min_33085875 UCSD 0. 9980790 0. 9967357 0. 9934251 Fujistu Log File 0. 9980620 0. 9966940 0. 9933790 Error 0. 002% 0. 004% 0. 005% (Red curve is UCSD result)

703 KR Simulation Results • The Circuit: – 514 K Capacitors, 76 K Current

703 KR Simulation Results • The Circuit: – 514 K Capacitors, 76 K Current Sources, 370 K Inductors and 703 K Resistors. – 1. 3 M Nodes • Accuracy: – Measurement results match Fujitsu’s with less than 0. 02% error. • Runtime / Memory Comparison: CPU_Time Memory Computer Used UCSD 2575 s (0. 7 h) 1. 7 G Pentium 4 3. 2 G, Linux Fujistu Log File 864561 s (240 h) 2. 28 G unknown

703 KR Simulation Results • Measurement results and waveform Min_33096003 Min_33096004 Min_33097557 UCSD 0.

703 KR Simulation Results • Measurement results and waveform Min_33096003 Min_33096004 Min_33097557 UCSD 0. 9400988 0. 9421157 0. 9370827 Fujistu Log File 0. 9399610 0. 9419260 0. 9368400 Error 0. 015% 0. 026% (UCSD results only. Fujitsu waveform is not available for comparison)

Further Speed-ups • Reduce iteration count by 50% for pure linear circuits (like 6

Further Speed-ups • Reduce iteration count by 50% for pure linear circuits (like 6 x 6 bump and 703 KR) – 2 x speed up • More effective time step control – DVDT, breakpoint, truncation error. 1. 5 - 3 x speed up • Use Multigrid solver – 1. 5 - 2 x speed up for medium circuits (6 x 6 bump) – 2 x – 10 x speed up for large circuits (703 KR) • Parallel simulation – 4 or more processors on linux cluster – 32 to hundreds of processors on supercomputer. • Overall speed-up – 6 x - 60 x speed up without parallel simulation – 12 x - 1000 x speed up with parallel simulation

Performance and capacity prediction Preferred Solver Cpu Time Memory Small - Medium 0. 3

Performance and capacity prediction Preferred Solver Cpu Time Memory Small - Medium 0. 3 M nodes LU Decomposition 11 minutes 600 M Medium - Large 1. 3 M nodes Multigrid 43 minutes 1. 7 G Huge 10– 100 M nodes Multigrid + Parallel 5 – 100 hours Cases 10 x-100 x larger than 703 KR. 15 G - 200 G

Overview of Simulation Load Circuit Our research Device Evaluation Integration Approximation Linearization LU Decomposition

Overview of Simulation Load Circuit Our research Device Evaluation Integration Approximation Linearization LU Decomposition N-R Converge? Yes Time Step Control Next Time Point No • Fast speed with SPICE accuracy • Nonlinear devices • Efficient matrix solvers • Effective integration methods • Time step controls according to different integration methods • Distributed computing

Overview of Simulation • Matrix Solver • LU Decomposition • Iterative Approach • Integration

Overview of Simulation • Matrix Solver • LU Decomposition • Iterative Approach • Integration • Time Step Control • ADI • Nonlinear Devices • Two Stage Newton Raphson • Distributed Computing • Commercial Implementation

Overview of Simulation • Integration • Time Step Control • ADI (two-way partitioning) •

Overview of Simulation • Integration • Time Step Control • ADI (two-way partitioning) • Operator Splitting (multi-way) • Distributed Computing • MPI • Partitioning • Three Ph. D. Students

Commercial Package: Fastrack Design • Founded in January 2001 • Headquartered in San Jose

Commercial Package: Fastrack Design • Founded in January 2001 • Headquartered in San Jose • Privately funded, cash-flow positive • Two Business Units • Design Services • Technology Products

Analog Designs Design # Elements Sim. Len HSpice m. SPICE SPEEDUP FACTOR 13490 20

Analog Designs Design # Elements Sim. Len HSpice m. SPICE SPEEDUP FACTOR 13490 20 us 80 h 26 h 3. 1 X 222 1 ms 13, 706 s 2, 670 s 5. 1 X Biasing Circuit 49197 200 ns 427 s 82 s 5. 2 X PLL 16050 40 us 67 d 12 d 5. 6 X PLL 300 K 40 us 290 d (est) 16 d 18. 1 X LVDS Oscillator (post-layout)

Digital Blocks Design Name MOS R C m. SPICE Traditional Spice ALU 10. 1

Digital Blocks Design Name MOS R C m. SPICE Traditional Spice ALU 10. 1 k 12. 7 k 7. 5 k 6. 9 m 7 m 1. 0 X 69 k 83. 7 k 52. 5 k 1. 5 h 9. 5 h 6. 3 X YN_BLK 205 K 242. 8 k 203. 9 k 3. 5 h > 2 d >13. 7 X THP 437 k 499. 3 k 313. 5 k 5. 0 h COULD NOT RUN ∞ VCON 936 k 753 k 561 k 15. 0 h COULD NOT RUN ∞ CONTROL Devices Runtime Speedup Factor

Memory Blocks Design # Tr # R # C # Vectors / Sim. Length

Memory Blocks Design # Tr # R # C # Vectors / Sim. Length m. SPICE Run Time BRAM (pre) 220 K 0 500 2 2. 5 hours SRAM (pre) 8 Kx 8 SP 410 K 0 0 2 7 hours e. RAM (post) 256 x 16 72 K 28 K 427 K 48 ns 8 hours BRAM (post) 220 K 1320 K 870 K 2 18 hours • 100% accurate Spice simulation

m. SPICE-Parallel • Industry’s first practical parallel Spice simulation solution – Increases capacity further

m. SPICE-Parallel • Industry’s first practical parallel Spice simulation solution – Increases capacity further – Dramatically improves throughput • Uses Matrix Level Partitioning – No loss of accuracy – Client-Server configuration – Minimal memory requirement for client nodes

Client-Server Configuration 1 0 0 0 1 0 1 0 0 0 1 0

Client-Server Configuration 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 10001010 01000101 101000 000100 10 01010 0001010 • Server distributes sub-matrices to clients • Clients communicate partial solutions • Minimal memory requirements for clients 10001010 01000101

Experimental Results Design Total Elements Sim. Length ASIC 1. 2 M 38 IO SSO

Experimental Results Design Total Elements Sim. Length ASIC 1. 2 M 38 IO SSO Runtime 1 -proc 2 -proc 4 -proc 8 ns 12. 2 h 7. 0 h (1. 7 X) 5. 1 h (2. 4 X) 1. 4 M 30 ns 3. 0 h 2. 0 h (1. 5 X) 1. 4 h (2. 2 X) Signal-power 2. 1 M 1. 2 us 13 d 7 d 18 h (1. 7 X) 5 d 12 h (2. 4 X) 4096 x 8 RAM (extracted) 2. 3 M 10 ns 32 h 18. 5 h (1. 7 X) 13. 4 h (2. 4 X) 120 IO SSO 3. 5 M 30 ns 6. 2 h 4. 1 h (1. 5 X) 3. 1 h (2. 0 X)

ADI: Previous Works • 1999, Namiki and Ito – the alternating direction implicit (ADI)

ADI: Previous Works • 1999, Namiki and Ito – the alternating direction implicit (ADI) is used to simulate a 2 D TE wave. • 2001, Zheng etc. – extend to 3 D problem • 2001 & 2003, Lee and Chen – ADI is used to transmission line modeled power grid The alternation is among different geometric directions, so the simulated geometric structure is constrained.

Alternating Direction Implicit (ADI) • ADI Integration Method – Two way partition of the

Alternating Direction Implicit (ADI) • ADI Integration Method – Two way partition of the circuit – One partition is used for each backward integration – Unconditional stable (A-stable: independent of time step size) – Time step size according to local truncation error.

Alternating Direction Implicit (ADI) • • • ADI method formulation Circuit partition algorithm Local

Alternating Direction Implicit (ADI) • • • ADI method formulation Circuit partition algorithm Local truncation error estimation Stability discussion Experimental results

SPICE Formulation • Equations for RLC circuits where C: capacitance matrix L: inductance matrix

SPICE Formulation • Equations for RLC circuits where C: capacitance matrix L: inductance matrix R: resistance matrix G: conductance matrix E: incidence matrix

ADI Formulation • Transient simulation – Split the resistors and inductors branches into two

ADI Formulation • Transient simulation – Split the resistors and inductors branches into two parts • G = G 1 + G 2 • E = E 1 + E 2 • R = R 1 + R 2 – Alternate Backward and Forward integration on each partition

ADI Formulation (Cont. ) • Equations of ADI method – the size of left-hand-side

ADI Formulation (Cont. ) • Equations of ADI method – the size of left-hand-side matrix remains unchanged – the number of non-zero elements is decreased – direct solving methods can be efficient

Experiments of non-zero fill-ins • A small ASIC Design Spice matrix : Dimension: 10,

Experiments of non-zero fill-ins • A small ASIC Design Spice matrix : Dimension: 10, 286 The number of non-zero elements: 46, 655 The number of non-zero fill-ins: 90, 960 • A large I/O Design Spice matrix : Dimension: 615, 436 The number of non-zero elements: 2, 126, 246 Sub-matrix 1 Sub-matrix 2 # non-zero elements # non-zero fill-ins Total # non-zero fill -ins Case 1 38, 572 2, 618 42, 020 10, 040 12, 658 Case 2 1, 176, 208 12, 421, 534 950, 038 14, 772, 068 27, 193, 602

Local Truncation Error (LTE) • Time step control using LTE – In circuit transient

Local Truncation Error (LTE) • Time step control using LTE – In circuit transient analysis, the next time step can be estimated from the local truncation error at the present time point – LTE is defined as the difference between the calculated solution and the exact solution – To ensure the consistency, the local truncation error should not exceed the error tolerance, thus the time step can be estimated using

Local Truncation Error (Cont. ) • LTE of ADI method (1) equations let then

Local Truncation Error (Cont. ) • LTE of ADI method (1) equations let then , , and

Local Truncation Error (Cont. ) • LTE of ADI method (2) Estimate exact solution

Local Truncation Error (Cont. ) • LTE of ADI method (2) Estimate exact solution we characterize the input as a simple ramp over the interval (tn, tn+1), the exact analytic solution with time step tn:

Local Truncation Error (Cont. ) • LTE of ADI method (3) Estimate ADI solution

Local Truncation Error (Cont. ) • LTE of ADI method (3) Estimate ADI solution

Local Truncation Error (Cont. ) • LTE of ADI method (3) Estimate ADI solution

Local Truncation Error (Cont. ) • LTE of ADI method (3) Estimate ADI solution

Local Truncation Error (Cont. ) • LTE of ADI method (4) LTE estimation

Local Truncation Error (Cont. ) • LTE of ADI method (4) LTE estimation

Local Truncation Error (Cont. ) • LTE of ADI method (5) Time step control

Local Truncation Error (Cont. ) • LTE of ADI method (5) Time step control

Local Truncation Error (Cont. ) • LTE of ADI method (5) Time step control

Local Truncation Error (Cont. ) • LTE of ADI method (5) Time step control

Stability Discussion • The stability is concerned with whether the accumulated error grows or

Stability Discussion • The stability is concerned with whether the accumulated error grows or decays as time evolves through a series of time steps. • One-step integration approximations, the error is accumulated by a factor of • If the final steady state error vector is smaller than the initial, then the integration method is stable. • In ADI integration method: – It can be proved to be unconditional stable

Experimental Results Circuit 1 Cuicuit 2 Circuit 3 1 k-cell #Nodes 10, 000 40,

Experimental Results Circuit 1 Cuicuit 2 Circuit 3 1 k-cell #Nodes 10, 000 40, 000 90, 000 10, 200 #Transistors 0 0 0 6, 500 Period 10 ns SPICE 3 CPU time (sec) 77. 8 485. 3 3, 061. 1 181. 6 #steps 115 114 193 CPU time (sec) 28. 6 117. 8 275. 2 523. 3 #steps 102 102 949 2. 7 x 4. 1 x 11. 1 x - ADI Speedup

Voltage drop of Circuit 3 (power mesh with sinks)

Voltage drop of Circuit 3 (power mesh with sinks)

Signal in 1 k_cell (ASIC design)

Signal in 1 k_cell (ASIC design)

General Operator Splitting • General operator splitting method – – Multiple way partitions Each

General Operator Splitting • General operator splitting method – – Multiple way partitions Each partition is considered separately in each time step simulation No geometry constrains Local truncation error is used to dynamically control time step size

General Operator Splitting • • • Fundamental theory Operator splitting formulation Local truncation error

General Operator Splitting • • • Fundamental theory Operator splitting formulation Local truncation error estimation Stability discussion Experimental results

Fundamental theory • In circuit transient simulation, the integration approximation is actually the approximation

Fundamental theory • In circuit transient simulation, the integration approximation is actually the approximation of the exponential operator • The exponential operators can be approximated in any order using a general scheme of fractal decomposition • The decomposition of exponential operators corresponds to the circuit multi-way partition New integration approximation in transient simulation

Fundamental theory • Approximation of exponential operator – General circuit equation and solution –

Fundamental theory • Approximation of exponential operator – General circuit equation and solution – If we characterize the input as a simple ramp over the interval (tn, tn+1), the exact analytic solution with time step tn – Exponential operator approximation • Forward Euler • Backward Euler • Trapezoidal

Fundamental theory • Decomposition of exponential operators (Masuo Suzuki, 1991, Physics) – – Function

Fundamental theory • Decomposition of exponential operators (Masuo Suzuki, 1991, Physics) – – Function First order: Second order: Third order: – (2 m-1)th and (2 m)th order:

Fundamental theory • Decomposition of exponential operators

Fundamental theory • Decomposition of exponential operators

General Operator Splitting Formulation • Transient simulation: – Apply the second order approximation –

General Operator Splitting Formulation • Transient simulation: – Apply the second order approximation – In each time step, every partition is calculated separately and trapezoidal integration is used for every partition – The size of left-hand-side matrix may be changed – The number of non-zero elements is definitely decreased – Can be easily extended to multi-way partitions

General Operator Splitting Formulation • Equations

General Operator Splitting Formulation • Equations

Local Truncation Error (Cont. ) • LTE of general operator splitting method Estimate solution

Local Truncation Error (Cont. ) • LTE of general operator splitting method Estimate solution

Local Truncation Error (Cont. ) • LTE of general operator splitting method Estimate solution

Local Truncation Error (Cont. ) • LTE of general operator splitting method Estimate solution

Local Truncation Error (Cont. ) • LTE of general operator splitting method LTE estimation

Local Truncation Error (Cont. ) • LTE of general operator splitting method LTE estimation

Local Truncation Error (Cont. ) • LTE of general operator splitting method LTE estimation

Local Truncation Error (Cont. ) • LTE of general operator splitting method LTE estimation

Local Truncation Error (Cont. ) • LTE of general operator splitting method LTE estimation

Local Truncation Error (Cont. ) • LTE of general operator splitting method LTE estimation

Stability Discussion • The trapezoidal integration method is unconditional stable for stable system. •

Stability Discussion • The trapezoidal integration method is unconditional stable for stable system. • In our operator splitting method, trapezoidal method is used for all the sub-systems still unconditional stable

Experimental Results Circuit 1 Cuicuit 2 Circuit 3 #Nodes 10, 000 40, 000 90,

Experimental Results Circuit 1 Cuicuit 2 Circuit 3 #Nodes 10, 000 40, 000 90, 000 #Transistors 0 0 0 Period 10 ns SPICE 3 CPU time (sec) 77. 8 485. 3 3, 061. 1 #steps 115 114 CPU time (sec) 164. 7 1011. 6 3435. 9 #steps 102 102 2. 1 x 2 x 1. 1 x GOS Comparison

Voltage drop of Circuit 3 (power mesh with sinks)

Voltage drop of Circuit 3 (power mesh with sinks)

Conclusions • We investigate alternating direction implicit and general operator splitting integration methods for

Conclusions • We investigate alternating direction implicit and general operator splitting integration methods for transistor-level circuit transient simulation. • In both methods, the circuit will be divided into several sub-circuits, thus the direct matrix solver is still efficient because the matrix is simplified. • Both methods are second order accurate and unconditional stable. • Overhead: – Circuit partition – Each time step consists of many sub-steps, each sub-step is a N-R iteration process • Better for circuits with large linear network

Distributed Computing • Distributed Processors – Cluster – Supercomputer – Multi-Core Processors (Intel Dual/Quad-Core,

Distributed Computing • Distributed Processors – Cluster – Supercomputer – Multi-Core Processors (Intel Dual/Quad-Core, IBM Cell etc. ) • Standard – MPI – Partitioning – Matrix Solver • Capabilities – Speed-up (10 -100+) – Memory Capacity (10 -100+)

Future Works • ADI method – • General operator splitting method – – •

Future Works • ADI method – • General operator splitting method – – • More experiments Design and implement multi-way circuit partition algorithm Implement multi-way general operator splitting program Derive LTE for general multi-way situation More experiments Distributed Computing – – MPI Standard Distributed Partitioning, Matrix Solver