Codesigned OnChip Logic Minimization Roman Lysecky Frank Vahid

  • Slides: 16
Download presentation
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of Education GAANN fellowship test

Introduction (On-chip Logic Minimization) 1 Initialize Minimizer I$ Proc. D$ 2 Execute MEM Minimizer

Introduction (On-chip Logic Minimization) 1 Initialize Minimizer I$ Proc. D$ 2 Execute MEM Minimizer 3 Indicate Completion ARM 7 DMA MEM On-chip Minimizer System-On-Chip

On-Chip Minimization Applications (IP Routing Table Reduction) n IP routing table reduction n Ternary

On-Chip Minimization Applications (IP Routing Table Reduction) n IP routing table reduction n Ternary CAM (Mc. Auley & Francis, 1993) n n n Routing tables of large network routers have over 30, 000 entries Fast IP routing lookup is difficult without using large hardware resources TCAM can be used to perform routing table lookup in single cycle Requires large resources and large power consumption Mask Extension (Liu, 2002) n n Uses two-level logic minimization to reduce the size of the routing table Good results but did not considering off-chip communication 138. 23. 16. 9 Incoming IP packet 138. 23. 16. 9 Destination IP Prefix Next hop 138. 23. 16. x Port 7 138. 23. x. x Port 5 125. x. x. x Port 3 Lookup IP in Routing Table Longest Prefix Match Port 7

On-Chip Minimization Applications (Access Control List Reduction) n Access Control List (ACL) n n

On-Chip Minimization Applications (Access Control List Reduction) n Access Control List (ACL) n n Used to restrict IP traffic through network routers ACL size can range anywhere from 300 (UCR CS&E Dept. ) to 10, 000 (AOL) Common use is to block a particular protocol or port number to avoid attacks such as Denial of Service attacks ACL Minimization n n Similar approach as used for IP routing table reduction However, order of the list must be preserved ACL Input Format Type Protocol In IP In Port Out IP Out Port Action

On-Chip Minimization Applications (Dynamic Hardware/Software Partitioning) n Dynamic hardware/software partitioning (JIT compilation for FPGAs)

On-Chip Minimization Applications (Dynamic Hardware/Software Partitioning) n Dynamic hardware/software partitioning (JIT compilation for FPGAs) n n Dynamically detects frequently executed loop and reimplements the software loops using on-chip configurable logic Requires logic synthesis tools to embedded on-chip Profiler Warp Processor Dynamic Partitioning Module Warp Processor MIPS/ ARM Configurable Logic Warp Processor I$ D$

ROCM n On-chip Logic Minimization Requirements n n On-chip Logic Minimization Goal n n

ROCM n On-chip Logic Minimization Requirements n n On-chip Logic Minimization Goal n n Limited data and instruction memory available Quality of results must still be close to optimal Execution time should remain reasonable Focus on developing an on-chip logic minimization tool that produces acceptable results with reasonable increases in execution time while using limited memory resources ROCM – Riverside On-Chip Minimizer n n n Two-level minimization tool Utilized a combination of approaches from Espresso-II (Brayton, et al. 1984) and Presto (Svoboda & White, 1979) Eliminate the need to computer the off-set to reduce memory usage Utilizes a single expand phase instead of multiple iterations On average only 2% larger than optimal solution

ROCM Results (Performance/Memory Usage) • ROCM executing on 40 MHz ARM 7 requires less

ROCM Results (Performance/Memory Usage) • ROCM executing on 40 MHz ARM 7 requires less than 1 second • Small code size of only 22 kilobytes • Average data memory usage of only 1 megabyte 500 MHz Sun 40 MHz ARM 7 (Triscend A 7) Ultra 60

Codesign ROCM (Hardware Coprocessor) • Customized ROCM enables us to develop an efficient hardware

Codesign ROCM (Hardware Coprocessor) • Customized ROCM enables us to develop an efficient hardware coprocessor • Profiled the execution of ROCM-32 and ROCM-128 using ARM port of the Simple. Scalar simulator • Determine critical loops/functions that are suitable for implementation in hardware • Identified six critical kernels that comprised 91% of the total execution time but only 2% of the code size

Codesign ROCM (Minimization Coprocessor) data addr Proc/Mem Interface ARM 7 MEM Min. Coproc. Is.

Codesign ROCM (Minimization Coprocessor) data addr Proc/Mem Interface ARM 7 MEM Min. Coproc. Is. Cov Tautology. 1 Set. Lit Get. Lit Does. Inter Cofactor. 1 On-Chip Minimizer Minimization Coprocessor

Codesign ROCM (Minimization Coprocessor) a. Impl d. Impl 64 64 data addr num. Lits

Codesign ROCM (Minimization Coprocessor) a. Impl d. Impl 64 64 data addr num. Lits 5 << 1 Proc/Mem Interface Is. Cov 32 (odd) Set. Lit Get. Lit << Tautology. 1 Does. Inter 32 (even) Cofactor. 1 Intersect Minimization Coprocessor == 0 ret. Val Does. Intersect

Codesign ROCM Results (Execution Time) • Average speedup of 7. 8

Codesign ROCM Results (Execution Time) • Average speedup of 7. 8

Codesign ROCM Results (Energy Consumption) • Average energy reduction of 59. 2%

Codesign ROCM Results (Energy Consumption) • Average energy reduction of 59. 2%

Codesign ROCM (Minimization Coprocessor) • Software modifications were required to achieve speedup of 7.

Codesign ROCM (Minimization Coprocessor) • Software modifications were required to achieve speedup of 7. 8 • Data structures/algorithms not suitable for hardware implementation • Reorganized data structures • Customized width of data items • Eliminate memory allocation within critical regions • Not automated with current hardware/software partitioning tools

Codesign ROCM (Minimization Coprocessor) for(i=0; i<F->num. Implicants; i++) { if( !Does. Intersect(implicant, xj) )

Codesign ROCM (Minimization Coprocessor) for(i=0; i<F->num. Implicants; i++) { if( !Does. Intersect(implicant, xj) ) continue; for(k=0; k<xj->num. Literals; k++) { } } // determine co. Implicant. . . Add. Implicant(cofactor, &co. Implicant); Requires dynamic memory allocation Original C Code Move to HW 28. 5% of total exec. time Only 3. 5% of total exec. time

Codesign ROCM (Minimization Coprocessor) // determine size of cofactor initially cofactor. Size = 0;

Codesign ROCM (Minimization Coprocessor) // determine size of cofactor initially cofactor. Size = 0; for(i=0; i<F->num. Implicants; i++) { if( !Does. Intersect(implicant, xj) ) continue; cofactor. Size++; } // allocate all memory outside of main loop cofactor->implicants = malloc(…); for(i=0; i<F->num. Implicants; i++) { if( !Does. Intersect(implicant, xj) ) continue; for(k=0; k<xj->num. Literals; k++) { // additional initialization code need for each iterations co. Implicant = &(cofactor->implicants[index++]); } } . . . Modified C Code

Conclusions & Future Work • Developed codesigned on-chip logic minimization • Performance improvement of

Conclusions & Future Work • Developed codesigned on-chip logic minimization • Performance improvement of nearly 8 X compared to earlier software only implementation • Energy reduction of almost 60% • New directions in hardware/software partitioning • Designer effort was required to rewrite algorithms and fine tune data structures • Could better hardware/software partitioning tools automate this?