Codesign Extended Applications Brian Grattan Greg Stitt Frank
Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid* Dept of Computer Science & Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation and by NEC C&C Research Labs 1
Outline n Introduction: Hardware/Software Partitioning n n n And the common assumption of a single specification Different Algorithms in Hardware/Software Codesign Extended Applications Experiments Future Work and Conclusions CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 2
Introduction – Hw/Sw Partitioning n Hw/sw partitioning can speedup software n Shown by numerous researchers n n n 1. 5 to 10 x common Some examples like image processing get 100 -800 x speedup n n E. g. , Balboni, Fornaciari, Sciuto CODES’ 96; Eles, Peng, Kuchchinski, Doboli DAES’ 97; Gajski, Vahid, Narayan, Gong Prentice-Hall 1997; Grode, Knudsen, Madsen DATE’ 98; many others E. g. , Cameron project, FCCM’ 02 Can reduce energy too n E. g. n n n Henkel, Li CODES’ 98 Wan, Ichikawa, Lidsky, Rabaey CICC’ 98 Stitt, Grattan, Villarreal, Vahid FCCM’ 02 n 60 -80% energy savings measured on real single-chip u. P/FPGA devices CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 3
Hw/Sw Partitioning on Single-Chip Platforms Configurable logic n Numerous single-chip commercial devices with u. P and FPGA n n n n Triscend E 5 (shown) Triscend A 7 Atmel FPSLIC Xilinx Virtex II Pro Altera Excalibur More sure to come… Make hw/sw partitioning even more attractive u. P and peripherals CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside Cache/memory 4
Hw/Sw Partitioning – Commercial Tools Evolving n Commercial products evolving n n Synopsys’ Nimble compiler (2000) attempt Proceler n n Microprocessor Report’s 2001 Technology of the Year Award Others coming… CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 5
Hw/Sw Partitioning – Single-Spec Assumption n Assumption – Start from a single specification n Typically sw source Partitioning n Specification Find critical sw kernels, map some to hw Hw/sw partitioner Sw Hw Compilation Synthesis Binaries Netlists This assumption is made in most research efforts as well as commercial tools CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 6
Digital Camera Example n Developed with intent of exploring hw/sw tradeoffs n n Captures images, compresses, uploads to PC Soon found that a single specification wasn’t reasonable n DCT Huffman encoder Encoder Controller CCD Pre-Processor Communications CRCCRC calculation Two key functions had different hw/sw algorithms n n CRC DCT CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 7
Digital Camera Example n Results in weak hw design n n We would have written CRC and DCT differently had we known they’d be mapped to hw Yet, we’d keep the original algorithms if they ended up in software Spec: DCT, Huffman, CRC, CCD, Ctrl Hw/sw partitioner Sw: Huff. , CCD, Ctrl Hw: CRC, DCT Compilation Synthesis Binaries Netlists Weak CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 8
Different Algorithms in Hw vs. Sw n n The single-specification assumption doesn’t always hold Key observation n n Designers often use very different algorithms if a behavior is mapped to hardware versus if that behavior is mapped to software Widely known by designers n n In textbooks Also known in parallel processing – sequential and parallel algorithms CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 9
Different Algorithms – Sorting Example n Suppose desired behavior fills a buffer, sorts the buffer, and transmits the sorted list Quicksort Fill() Sort() Transmit() n Sort() in software –Quick. Sort n n n Sort() in hardware – Parallel Mergesort n n n Simple and fast in sw Poor in hw, can’t be parallelized well Very fast in hardware Slow in sw (if sequential) due to overhead MS MS Derive one from the other? CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside MS MS … 10
Different Algorithms – CRC Example n CRC – Cyclic Redundancy Check n n Used for error checking during communication, stronger than parity Mathematically, divides a constant into the data and saves the remainder Main Function … calls crc() with parameters: init_crc-initial value *data-pointer to data len-length of data jinit-initializing options crc() returns: value of CRC for given data crc/data/data CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 11
Different Algorithms – CRC in Hardware char crc_hw(…) { unsigned short j , crc_value = init_crc; unsigned short new_crc_value; if (jinit >= 0) crc_value=((uchar) jinit) | (((uchar) jinit) << 8); for (j=1; j<=len; j++) { new_crc_value = bit(4, data[j]) ^ bit(0, data[j]) ^ bit(8, crc_value) ^ bit(12, crc_value); // bit 0 new_crc_value = new_crc_value | (bit(5, data[j])^bit(1, data[j])^bit(9, crc_value)^bit(13, crc_value))<<1; new_crc_value = new_crc_value | (bit(6, data[j])^bit(2, data[j])^bit(10, crc_value)^bit(14, crc_value))<< 2; . … continue for bits 3 through 7 …. } return (new_crc_value); } n Hardware Version n Knowing the generator polynomial, one can calculate the XOR’s for each individual bit Each CRC value is the result of bit-wise XOR’s with the data and the previous CRC value Synthesizes to hw very nicely; but getting bits and shifting are inefficient in sw CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 12
Different Algorithms – CRC in Software Version n Before doing any calculations, create an initialization table that calculates the CRC for each individual character Use data as index into initialization table and execute two XOR’s Requires lookups, but faster for a sequential calculation char crc_sw(…) // Source: Numerical Recipes in C { unsigned short initialize_table(unsigned short crc, unsigned char one_char); static unsigned short icrctb[256]; unsigned short tmp 1, j , crc_value = init_crc; if (!init) { init=1; for (j=0; j<=255; j++) { icrctb[j]=initialize_table(j << 8, (uchar)0); } } if (jinit >= 0) crc_value=((uchar) jinit) | (((uchar) jinit) << 8); for (j=1; j<=len; j++) { tmp 1 = data[j] ^ HIBYTE(crc_value); crc_value = icrctb[tmp 1] ^ LOBYTE(crc_value) << 8; } } return (crc_value); } CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 13
Different Algorithms -- DCT n DCT – Discrete Cosine Transform n n n Computationally intensive, numerous matrix multiplies Accounts for perhaps 70% of JPEG encoding time Dozens of possible algorithms n n n Best algorithm depends largely on computational resources Certainly different for sw and hw Doing multiplications in floating-point vs. fixed-point n Multiplication by a constant can be efficiently mapped to hardware, but accuracy will be lost by not using floating-point CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 14
Codesign Extended Applications (CEAs) n Basic idea: n Write two versions of certain functions n n n Only the critical functions, and Only those with different sw and hw algorithms Typically only a handful of these n n Most time is spent in just a few critical functions Include both function versions in the specification n But use compiler flags to include either sw or hw version main() { … crc(); … } char crc(…) { #ifdef cea_crc_hw(…); #else crc_sw(…); #endif } % gcc –Dcea_crc_hw main. c CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 15
CEAs when using C/C++ and VHDL code C code crc_hw(…inputs…) /* Hardware crc. . . */ for (j=1; j<=len; j++) { TSHORT(to_hw)= data[j]); TBYTE(enable) = 1; TBYTE(enable) = 0; } crc_value=TSHORT(result); return (crc_value) if (rst = '1') then crc <= "00000000"; done <= '0'; elsif (clk'event and clk = '1') then if (enable = '1') then if done = '0' then crc <= next. CRC 16_D 8(input, crc); done <= '1'; end if; else done <= '0'; output <= crc; end if; CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 16
CEAs Enable Hw/Sw Partitioning Tool n Traditional hw/sw partitioner n n n CEAs plus platforms result in simple partitioner n Script uses existing compiler, n n Compiler, estimators, search heuristics, technology files, etc. Drawback: heavy impact on tool flow synthesis, and evaluation (simulation or physical measurement) Drawbacks: must write two versions of critical functions, script may use simpler search function Different partitioners for different domains Specification Essentially a compiler, search heuristic, and estimator. Heavy-duty tool. Hw/sw partitioner Sw Hw Compilation Synthesis Binaries Netlists CEA Search heuristic and tool control. Lightweight tool. Script Sw Hw Compilation Synthesis Binaries Netlists CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside Evaluator 17
Experiments n Compared hw and sw CRC algorithms n n n Synthesized to FPGA Compiled to MIPS u. P Demonstrates need for different algorithms Sw and hw CRC algorithms in FPGA. Size (Blocks) Delay (clock cycles/character) Hardware CRC algorithm 19 1 Software CRC algorithm 44 3 Sw and hw CRC algorithms on a microprocessor. Size (Assembly Lines) Clock Cycles Software CRC Algorithm 1061 180, 000 Hardware CRC Algorithm 1298 814, 000 CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 18
Experiments n Wrote small signal processing example as CEA n Wrote sw and hw versions of core functions n n Setup power measurement for two real platforms n n n In this case, algorithms were similar XS 40 (board with microcontroller chip and Xilinx FPGA chip) E 5 (single chip with microcontroller and FPGA) Partitioning script automatically partitioned and measured power and cycles (overnight – due to place & route time) n n Demonstrates how CEAs enable simple yet practical hw/sw partitioning Easily migrates to different platforms, different chips Partitioning Sum Multiply SW SW SW HW HW HW SW SW HW HW Energy (Joules) on E 5 device Bit-Share SW HW SW SW HW HW SW HW 12. 4 8. 6 8. 8 8. 0 4. 8 Does not Route CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 19
Issues and Future Work n Issues n n What if hw versions not used after partitioning? Wasted effort? Verification of all possible combinations? Must use wisely or problem grows unwieldy Future work n n More examples, more platforms Several versions of the same function n n n One hardware area-conscious One hardware speed-conscious One software code-size-conscious One software speed-conscious …more… Experimenting with communication between hardware and software n DMA transfer, wide-access memories, … CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 20
Conclusions n n n Basic hw/sw partitioning assumption of a single specification doesn’t always hold Codesign Extended Applications help support different algorithms CEAs enable hw/sw partitioning in existing tool flows n n Utilizes existing compilation, synthesis, mapping, evaluation tools, and platforms Simple yet effective approach to hw/sw partitioning CODES’ 02 – Codesign Extended Applications Brian Grattan, Greg Stitt, Frank Vahid, Univ. of California, Riverside 21
- Slides: 21