COMPARISON OF ALTERA NIOS II PROCESSOR WITH ANALOG

  • Slides: 28
Download presentation
COMPARISON OF ALTERA NIOS II PROCESSOR WITH ANALOG DEVICE’S TIGERSHARC

COMPARISON OF ALTERA NIOS II PROCESSOR WITH ANALOG DEVICE’S TIGERSHARC

Outline What is a “Soft” Processor What is the NIOS II? Architecture for NIOS

Outline What is a “Soft” Processor What is the NIOS II? Architecture for NIOS II, what are the implications • Tiger. SHARC VS. NIOS II • Pipeline Issues • Issues related to FIR Hardware acceleration, using FPGA logic

What’s is a “Soft” Processor? Processor implemented in VHDL, Verilog, etc. , and downloaded

What’s is a “Soft” Processor? Processor implemented in VHDL, Verilog, etc. , and downloaded onto FPGA hardware Can implement many parallel processors on one FPGA Can use addition FPGA resources on the same chip that is not part of the processor core. NIOS II is a “Soft” Processor

Why “Soft” Processor? Higher level of design reuse Reduced obsolescence risk Simplified design update

Why “Soft” Processor? Higher level of design reuse Reduced obsolescence risk Simplified design update or change Increased design implementation options Lower latency between processor and FPGA components

What is NIOS II? Software-defined processor The processor core is loaded onto FPGA Programmed

What is NIOS II? Software-defined processor The processor core is loaded onto FPGA Programmed using ‘normal’ programming tools (C, asm), not hardware description languages Can use the rest of the FPGA hardware for accelerating parts of the code

How Is NIOS II Implemented The custom FPGA logic that interacts with the processor

How Is NIOS II Implemented The custom FPGA logic that interacts with the processor is implemented in Altera Quartus II The Avalon Interface bus (common instruction/data bus) is implemented in Quartus II The architecture is generated in Quartus II and used for programming in Eclipse IDE

NIOS II IDE Coding is implemented in Eclipse rather than Visual. DSP.

NIOS II IDE Coding is implemented in Eclipse rather than Visual. DSP.

The Different NIOS II Cores There are 3 cores available from Altera NIOSII/e: Economical

The Different NIOS II Cores There are 3 cores available from Altera NIOSII/e: Economical Core NIOSII/s: Standard Core NIOSII/f: Fast Core

What’s the Difference between the Cores? An LE is equivalent to a 8 -1

What’s the Difference between the Cores? An LE is equivalent to a 8 -1 NAND gate + 1 D-Flip Flop An ALM is equivalent to 2 LE’s

Comparison of Tiger. SHARC and NIOS II architecture

Comparison of Tiger. SHARC and NIOS II architecture

Tiger. SHARC Architecture

Tiger. SHARC Architecture

NIOS II Architecture -thirty two 32 -bit general registers, six 32 -bit control registers

NIOS II Architecture -thirty two 32 -bit general registers, six 32 -bit control registers -variable cache based on how much FPGA space you have -ALU- 32 bit two input to one input, does shifts, logic and arithmetic. Shifter is not separate like Tiger. SHARC

Avalon Interface -separate address, data and control lines -up to 1024 -bit data width

Avalon Interface -separate address, data and control lines -up to 1024 -bit data width transfer, can be set to any width (not power of 2) -one transfer per clock cycle.

NIOS II/f pipeline Six stages One instruction can be dispatched and/or retired pre cycle

NIOS II/f pipeline Six stages One instruction can be dispatched and/or retired pre cycle Dynamic branch prediction: 2 -bit branch history table (no BTB like in Tiger. SHARC)

NIOS II/f pipeline The pipeline stalls for: Multi-cycle instructions • Cache misses • Data

NIOS II/f pipeline The pipeline stalls for: Multi-cycle instructions • Cache misses • Data dependencies (2 cycles between calculating and using result) • Mispredicted branch penalty: 3 cycles

Hardware multiply Can use different options for multiplier (at the processor design stage) No

Hardware multiply Can use different options for multiplier (at the processor design stage) No h/w multiply (saves FPGA gates) ○ Speed depends on algorithm Use embedded multipliers (if FPGA has those) ○ 1 -5 cycles (depends on FPGA) Implement multipliers on FPGA gates ○ 11 cycles Division 4 -66 cycles on hardware

Compare to Tiger. SHARC No support for parallel instructions No support for SIMD operations

Compare to Tiger. SHARC No support for parallel instructions No support for SIMD operations Multicycle instructions stall the pipeline All the above limitations can be overcome by using FPGA space unoccupied by the processor itself

Comparison of NIOS II and Tiger. SHARC on an FIR Algorithm

Comparison of NIOS II and Tiger. SHARC on an FIR Algorithm

Integer FIR algorithm int coeff[]={1, 2, 3, 4, 5, 6, 7, 8}; int data

Integer FIR algorithm int coeff[]={1, 2, 3, 4, 5, 6, 7, 8}; int data 1[] = {1, 0, 0 , 0 , 0}; int output[8]; int i=0, j=0, k=0; for(k=0; k<8; k++) output[k] =0; for( j =0; j< 8; j++) { for( i= 0; i< 8; i++) { output[j] += data 1[i]*coeff[7 -i]; } }

Speed analysis movi r 4, 8 i=8 ldw r 2, 0(r 6) load data

Speed analysis movi r 4, 8 i=8 ldw r 2, 0(r 6) load data 2 ldw r 3, 0(r 7) load coefficient 3 addi r 4, -1 i-- 4 addi r 6, 4 coeff. Pt++ 5 mul r 2, r 3 data = data * coeff 6 addi r 7, -4 data. Pt-- 7 stall data stall – waiting for multiplication result 8 add r 5, r 2 output += data 9 bne r 4, zero, 0 x 10002 a 0 will mispredict 2 times in the beginning, and 1 time in the end of the loop (waste 3 cycles each time) 0 1 Loop:

Speed analysis 9 cycles per iteration except the first two (branch predicted not taken)

Speed analysis 9 cycles per iteration except the first two (branch predicted not taken) and the last (branch predicted taken) – those will be 9+3=12 cycles 1 data stall – can remove by moving instruction from line 4 to 7 Speed: 8 cycles * (N-3) + 11 cycles * 3 = 8*(N-3)+33 cycles For 1024 -tap FIR: 8201 cycles Clock cycle is 3 times longer (200 MHz vs 600 MHz)

Speed comparison 8201 NIOS II cycles equivalent to 24603 Tiger. SHARC cycles • Lab

Speed comparison 8201 NIOS II cycles equivalent to 24603 Tiger. SHARC cycles • Lab 3 timing: • – 56000 cycles Debug mode – 13000 unoptimized ASM – 4000 Optimized ASM Worse than unoptimized assembly, but no hardware acceleration used, so this is not that bad

Hardware Acceleration Profiling tool in Eclipse can show long each function takes If function

Hardware Acceleration Profiling tool in Eclipse can show long each function takes If function takes too long, it can be sped up by Custom instructions Hardware Acceleration is to take the function and transform it into FPGA circuitry

Hardware Acceleration Can be done using C 2 H compiler from Altera Trades off

Hardware Acceleration Can be done using C 2 H compiler from Altera Trades off Logic Size for Speed up. Table 1. User Application Results Example Algorithm Speed Increase (vs. Nios II CPU) System f. MAX (Mhz) System Resource Increase (1) Autocorrelation 41. 0 x 115 124% Bit Allocation 42. 3 x 110 152% Convolution Encoder 13. 3 x 95 133% Fast Fourier Transform (FFT) 15. 0 x 85 208% High Pass Filter 42. 9 x 110 181% Matrix Rotate 73. 6 x 95 106% RGB to CMYK 41. 5 x 120 84% RGB to YIQ 39. 9 x 110 158%

Conclusion “Soft” Processors such as the NIOSII offers another alternative in the embedded system

Conclusion “Soft” Processors such as the NIOSII offers another alternative in the embedded system scene. The NIOSII offers the advantage of added configurability, and customization that blur the line between FPGAs and DSPs

References [1] http: //www. fpgajournal. com/articles/behere. htm Describes an FPGA-DSP project based on Altera

References [1] http: //www. fpgajournal. com/articles/behere. htm Describes an FPGA-DSP project based on Altera Nios [2] http: //www. altera. com/products/ip/processors/nios 2/ni 2 -index. html Official Nios II page [3] http: //www. hunteng. co. uk/dsp-fpga. htm DSP or FPGA? What is better when? [4] http: //www. hunteng. co. uk/pdfs/tech/DSP 1736 FPGA. pdf Article from Xilinx about FPGA DSPs [5] http: //www. niosforum. com Community forum for NIOS [6] http: //www. altera. com/literature/hb/nios 2/n 2 cpu_nii 5 v 1. pdf NIOSII Processor Handbook –Altera Corporation [7] http: //www. altera. com/literature/manual/mnl_avalon_spec. pdf Avalon Memory-Mapped Interface Specifications – Altera Corporation [8] http: //www. analog. com/en/prod/0, 2877, ADSP%252 DTS 201 S, 00. html ADSP-TS 201 S 500/600 MHz Tiger. SHARC Processor with 24 Mbit on-chip embedded DRAM