Xtensa A Configurable Embedded Microprocessor Feb 2013 Jerry

Xtensa – A Configurable Embedded Microprocessor Feb 2013 Jerry Redington Principal System Architect

Market Accepted, Market Proven Over 2 Billion Cores Worldwide Home Entertainment Mobile Wireless Smart. Phone DTV i. Phone 4 Blu-ray Receiver Samsung Galaxy-S Blackberry Bold 9780 Base. Station Fujitsu LTE F-01 D Android Tablet Auto Info. Tainment Wireless Digital Cameras STB Network Access Games Ultra. Books Printers Storage Copyright © 2013, Tensilica, Inc. All rights reserved. Network Infrastructure PC Graphics 2

Congratulations University of Florida • You are part of our University access program – You have the ability to download our Xtensa Xplorer IDE – Create an unlimited number of processor cores for software (ISS), hardware (FPGA) or System C simulations • Create processors with almost all of our configuration options • Access to our prebuilt Diamond and Conn. X DSP processors – Create custom interfaces and custom instructions with our TIE language (Verilog like) • Create interfaces to augment data transport between the external world and Xtensa • Create a range of instructions that will affect computational capacity – Produce RTL suitable for FPGA exploration • Target supported FPGA platforms with a complete microprocessor • Create a Xilinx NGO netlist for inclusion in your FPGA SOC target Copyright © 2013, Tensilica, Inc. All rights reserved. 3

RISC Microprocessors Have similar features, however implemented very differently • Modern RISC/DSP architectures – All have instruction sets, however the instruction format varies • • Width of instruction, 16, 24, 32, 40…, 128 (VLIW) Fixed versus variable length, intermixing of instruction formats, multiple format encodings Single / Multiple issue SIMD – Compiler support • • Minimum features; load/store, move, arithmetic, logical, shift, jump/branch, Processor control Floating point (single/double) Dividers, Multipliers, MAC (different format widths and sign) Saturation, min/max, DSP, zero over head loop… So many more – Load / Store Architecture • Memory widths vary 16, 32, 64, 128, 256, 512 bits per transaction • Single, dual, or more load-store units • Register file(s) – single or multiple register files, width, depth (Compiler support) – # of read/write ports per instruction, # of read/write ports per VLIW instruction word – Windowed / shadowed RF Copyright © 2013, Tensilica, Inc. All rights reserved. 4

RISC Microprocessors Have similar features, however implemented very differently • Modern RISC/DSP architectures – Memory sub-system • Unified, Private address range • TCM, Tightly coupled (single cycle) memory interfaces • Instruction / Data cache – cache depth, line length, line locking, write through / write back, critical word first, line fill policies, replacement algorithms and of course exception handling • FIFO interfaces (handshake interface) • GPIO – Exception / Interrupt Architecture • Exception causes • Interrupt sources, priority levels, NMI, vector entry points Copyright © 2013, Tensilica, Inc. All rights reserved. 5

Why So Many Choices? All machines have a bias • Simply, embedded processors are biased toward and application • What drives microprocessor features – Different markets value features differently • Cell phones (battery and cost sensitive) – Value power, die area, performance • Desktop computers – Value performance, power and die area • USB Flash memory sticks – Die area, power, performance • Applications drive microprocessor features – Audio codecs (math fixed precision bias) – – Video codecs (fixed/floating point, SIMD) Image processing Baseband processors slanted towards wide SIMD Crypto engines (bit manipulation) Copyright © 2013, Tensilica, Inc. All rights reserved. 6

Xtensa: Integrates Multiple Strengths Into A Single Microprocessor Dataplane Processor Unit DPU • 10 -100 x better performance than DSP/CPUs Better control and tools than DSPs More flexible than custom logic • • U CP gt n tre S hs DS P Custom Strengths St re ng th s CPU 10 -100 x better performance Custom Logic than DSP/CPUs DSP Strengths Control-oriented, Software Development Strengths Task-specific, Differentiating, Direct point-to-point interfaces. Strengths SIMD, VLIW, Stream processing Copyright © 2013, Tensilica, Inc. All rights reserved. 7

Degrees of Freedom with Xtensa • Configuration Options – – Pre-built features presented in a menu style Memory interfaces ($$, TCM) Pre-defined instructions (floating point, DSP, audio, baseband DSP) Interrupt and memory map • TIE: User Defined Interfaces – GPIO – FIFO – Look-up-table (light weight memory interfaces) • TIE: User Defined Instructions – Single cycle – Multi-cycle – Limited by your imaginations and of course physical rendering limitations • Xilinx FPA support for commercial development boards (Xilinx ML 605) – – GUI support for target boards Download configurations directly into FPGA for software development JTAG probes for command control of debug sessions Trace logic for non-intrusive debug sessions Copyright © 2013, Tensilica, Inc. All rights reserved. 8

Xtensa – Configurability Click-box Options Include Pre-defined Extensions Simple menus of options • From fine tuning of performance, power and area – Size, type, width and access latency of memories. Optional prefetch unit. – Load/Store unit characteristics – Number of general purpose registers – Number and priority levels of interrupts • To high-level, market-specific building blocks – Common functional units: • Floating point, multiplier, divider, NSA – Complex application engines: • • Hi. Fi Audio DSP family Conn. X BBE 16/32/64 Baseband DSP family Conn. X Vectra LX quad-MAC DSP Conn. X D 2 dual-MAC DSP Copyright © 2013, Tensilica, Inc. All rights reserved. 9

Xtensa – Extensibility Customize a DPU to Your Task Using a simple Verilog-like language I/O Queues 3 256 bit queues and “add” operation: Add: • • • Inputs and outputs Scratchpad memories Simple single-cycle instructions Multi-cycle instructions SIMD for vectorization FLIX for parallel operations queue in. A 256 in queue in. B 256 in queue out. C 256 out in. A in. B + out. C operation ADD_XFER {} {in in. A, in in. B, out. C} { assign out. C = in. A + in. B; } Single Cycle Instruction: Byteswap: operation BYTESWAP {out AR out. Reg, in AR inp. Reg}{} { assign out. Reg = in. Reg { inp. Reg[7: 0], byte 3 byte 2 byte 1 byte 0 inp. Reg[15: 8], inp. Reg[23: 16], inp. Reg[31: 24] }; byte 0 byte 1 byte 2 byte 3 } out. Reg Copyright © 2013, Tensilica, Inc. All rights reserved. 10

Complete Development Tool Chain Mature and integrated for efficient development • Automatically adapts to options and any custom extensions – Use for all Xtensa DPUs – In single and multi-processor developments • Comprehensive development environment – Xplorer IDE – Eclipse-based GUI • Multiple processor system creation – Includes industry-leading vectorizing compiler • Advanced optimizations with automatic speed/area optimization – Debugging, profiling, linking, assembling, power estimation tools • GNU tools supported too • TRAX - Program trace module with compression – Simulated or real target hardware trace Copyright © 2013, Tensilica, Inc. All rights reserved. 11

Best in Class Simulation Models Options at Every Level of Abstraction • Cycle-accurate, pipeline-modeled ISS – most accurate in industry – Included as part of the SDK • Turbo. Xim: Fast functional simulator for software development – Offers mixed mode simulation with ISS to generate statistical profiling information – Performance in 10 -50 Million simulation cycles per second • On typical low cost PCs (3 GHz Intel Xeon 5160 running Linux) • System modeling support – XTMP and XTSC • C and System. C transaction based models – Pin-Level modeling • System. C modeling at the pin-level for RTL co-simulation – Supported by all major ESL vendors Copyright © 2013, Tensilica, Inc. All rights reserved. 12

Xtensa - Full Development Automation Making DPUs Usable by All Engineers Complete Hardware Design Pre-verified RTL EDA scripts test suite Processor Extensions Processor Configuration Use standard ASIC/COT design techniques and libraries for any IC fabrication process Xtensa Processor Generator* Iterate in Minutes! 1. Select from menu 2. Explicit instruction description (TIE) Customized Software Tools C/C++ compiler Debuggers Simulators RTOSes * US Patent: 6, 477, 697 Copyright © 2013, Tensilica, Inc. All rights reserved. 13

Xtensa Processor Generator Fully Automated Hardware and Software Tools Generation Designer-Defined Instructions (optional) Set/Choose Configuration options Xtensa Processor Generator Outputs Hardware EDA scripts RTL System Modeling / Design Software Tools Instruction Set Simulator (ISS) Xplorer IDE Graphical User Interface to all tools Fast Function Simulator (Turbo. Xim) Synthesis Block Place & Route Verification XTSC System Modeling Pin Level cosimulation XTMP Cbased System Modeling GNU Software Toolkit (Assembler, Linker, Debugger, Profiler) Xtensa C/C++ (XCC) Compiler C Software Libraries Chip Integration / Co-verification Xenergy Estimator Operating Systems To Fab / FPGA System Development Software Development Copyright © 2013, Tensilica, Inc. All rights reserved. Application Source C/C++ Compile Executable Profile using ISS Choose different configuration - or Develop new instructions 14

Complete Development Tool Chain Xplorer: Single IDE for All Development Stages The whole development flow in one integrated tool DPU Target ISS Debug + Trace Edit C, C++, ASM Partition/LSP Hardware Compile + Link System Models Simulate Co-sim C Libraries Profile Si Si Copyright © 2013, Tensilica, Inc. All rights reserved. F P A G FPGA 15

Inside Xtensa Copyright © 2013, Tensilica, Inc. All rights reserved. 16

Xtensa LX 4 Block Diagram - System Processor Controls Instruction Fetch / Decode Exception Support Exception Registers Trace Port JTAG Tap Control On-Chip Debug Data Address Watch Registers Instruction Address Watch Registers Timers Interrupt Control VLIW (FLIX) Parallel Execution pipelines Base ISA Execution Pipeline Instruction RAM x 2 Instruction ROM Instruction Cache System Bus External Interface Prefetch Register Files Processor State Write Buffer Bus Bridge Base ALU RAM AHB-Lite/AXI Processor Interface Control Optional Functional Units GPIO 32 Designer-Defined Queues, Ports & Lookups Base Register File PIF Bridge QIF 32 DMA Device Register Files Processor State Designer-Defined Functional Units Designer-Defined Dual Load/Store Unit RTL, FIFO, Memory, Xtensa Inst. Memory Management, Protection & Error Recovery Data Load/Store Unit Data Memory Management, Protection & Error Recovery KEY Base ISA Feature Configurable Function Designer-Defined Features (TIE) Optional Function External RTL & Peripherals Optional & Configurable Function Data RAM x 2 Data ROM Data Cache XLMI Local Memory Interface Copyright © 2013, Tensilica, Inc. All rights reserved. 17

Xtensa LX 4 Block Diagram – Optional Functional Units Processor Controls Instruction Fetch / Decode Exception Support Exception Registers Trace Port JTAG Tap Control On-Chip Debug Data Address Watch Registers Instruction Address Watch Registers Timers Interrupt Control VLIW (FLIX) Parallel Execution pipelines Inst. Memory Management, Protection & Error Recovery Base ISA Execution Pipeline Instruction RAM Instruction ROM Instruction Cache Register Files Processor State Write Buffer Bus Bridge Prefetch Optional Functional Units GPIO 32 Designer-Defined Queues, Ports & Lookups Base ALU AHB-Lite/AXI QIF 32 Processor Interface Control PIF Bridge External Interface Base Register Files Processor State Data Load/Store Unit MAC 16 DSP MUL 16/32 Integer Divide Single Precision Floating Point (FP) RAM Choose preverified functionality. Double Precision FP Acceleration DMA Click-box options Device and side-by-side Device profiling allow easy “what-if” assessments. 32 -bit GPIO pair (GPIO 32) 32 -bit Queue Interface pair (QIF 32) FLIX 3 (3 -issue FLIX configuration) Designer-Defined Functional Units Designer-Defined Dual Load/Store Unit Register Files Processor State Optional Functional System Bus Units Data Memory Management, Protection & Error Recovery Data RAM Data ROM Data Cache Hi. Fi 2, -EP or Hi. Fi 3 Audio Engine Conn. X D 2 DSP Engine Conn. X Vectra LX DSP Engine (1, 2 Load/Stores) Vectra. VMB (DSP Communications Acceleration Instructions) Conn. X BBE 16 / BBE 32 u. E / BBE 64 RTL, FIFO, Memory, Xtensa KEY Base ISA Feature Configurable Function Designer-Defined Features (TIE) Optional Function External RTL & Peripherals Optional & Configurable Function (Baseband DSP) XLMI Local Memory Interface Copyright © 2013, Tensilica, Inc. All rights reserved. 18

Xtensa LX 4 Block Diagram – Customization Processor Controls Instruction Fetch / Decode Exception Support Exception Registers Trace Port JTAG Tap Control On-Chip Debug Data Address Watch Registers Instruction Address Watch Registers Timers Interrupt Control VLIW (FLIX) Parallel Execution pipelines Base ISA Execution Pipeline Instruction RAM Instruction ROM Instruction Cache System Bus Prefetch Register Files Processor State Write Buffer Bus Bridge Base ALU AHB-Lite/AXI Processor Interface Control Optional Functional Units GPIO 32 Designer-Defined Queues, Ports & Lookups Base Register File PIF Bridge External Interface QIF 32 Designer-Defined Functional Units Data Load/Store Unit Data Memory Management, Protection & Error Recovery KEY Base ISA Feature Configurable Function Designer-Defined Features (TIE) Optional Function External RTL & Peripherals Optional & Configurable Function Customization RAM Multi-issue FLIX (automatically DMA used by the C compiler) Device SIMD Instructions Device Compound and Fusion instructions Register Files Processor State Designer-Defined Dual Load/Store Unit RTL, FIFO, Memory, Xtensa Inst. Memory Management, Protection & Error Recovery Data RAM Data ROM Data Cache Multi-cycle execution units Registers / register files with automatic C data type support GPIO and Queue interfaces Wide (128 -bit) load/store instructions XLMI Local Memory Interface Copyright © 2013, Tensilica, Inc. All rights reserved. 19

Data Transport Copyright © 2013, Tensilica, Inc. All rights reserved. 20

More flexible memory system A total of 6 “ways” are now supported (previously 4) – 4 -way cache AND local memories now supported More combinations of different memories, a total of 6 from: Instruction Interface: (0 -4 cache ways) +(0 -2 RAMs) +(0 -1 ROMs) Data Interface: (0 -4 cache ways) +(0 -2 RAMs) +(0 -1 ROMs) +(0 -1 XLMI) $ $ 0 -4 R A R MA M R O M 0 -2 Instruction $ $ 0 -4 Xtensa R A R MA M 0 -2 R O M X L M I 0 -1 Data Benefits – 4 cache ways with locking AND Prefetch extend this simple programming model approach into many more designs – Add local memories and have other bus masters write directly to it via Inbound. PIF in more complex and predictable systems Copyright © 2013, Tensilica, Inc. All rights reserved. 21

Conventional Processors • Bus-based connectivity RTL Data FSM RTL Buffer Data FSM System Bus

Xtensa Processors • Connect via the System Bus in the same way, or… • With multiple higher bandwidth, point-to-point interfaces RTL Buffer Data FSM System Bus Slave Interface to/from local mem >1 Kb >1000 Read Ports (GPIO) >1 Kb >1000 Read Queues FIFO >1 Kb Xtensa Processor With local Mem >1 Kb >1000 Write Ports (GPIO) >1 Kb FIFO >1000 Write Queues >1000 Special Memory interfaces Scratch/Table Scratch Mem lookup Mem Copyright © 2013, Tensilica, Inc. All rights reserved. 23

Multiple ports (GPIO) Eg. System Status and RTL control/setup • TIE Ports are GPIO interfaces – Over 1000 ports can be specified – Each port can be up to 1024 bits wide • Dedicated instructions – Operating in parallel with processor’s Load/Store Over 1000 interfaces Up to 1024 bits wide RTL Xtensa RTL RTL System Bus Copyright © 2013, Tensilica, Inc. All rights reserved. 24

Queue Interfaces Expand the functionality of an existing RTL design • Conventional processors/DSPs pass data over the system bus Data FSM DSP Data processing Buffer Data FSM System Bus RTL is often written instead - to avoid system and bus limitations 570 T Diamond Processor has one 32 bit input Queue and one 32 bit output Queue Xtensa can pass data directly, freeing up the system bus Up to 1024 bits wide, >1000 interfaces Data FSM Xtensa Data processing Buffer Data FSM System Bus Copyright © 2013, Tensilica, Inc. All rights reserved. 25

Dedicated Special Memory Interfaces Use special memory interface for tables, coefficients • Simple memory interface, not part of memory map – Index up to 4 G items – Each item up to ~1000 bits wide • Dedicated instructions – Operating in parallel to the processor’s Load/Store unit – User-defined number of access cycles – Read/Write multiple interfaces at once with VLIW Wide read/write. 4 G locations ~1000 data bits RTL Xtensa Coefficient, Mapping table Scratch memory ∆t System Bus RTL Dynamic Response Filter coefficient storage. Mapping tables. Scratch memory. Custom operations. Copyright © 2013, Tensilica, Inc. All rights reserved. 26

Instruction Designer Copyright © 2013, Tensilica, Inc. All rights reserved. 27

Instruction Format • Base instruction set is 24 -bit instructions ADD ar, as, at AR[r] AR[s] + AR[t] 23 0 10000000 r s t 0000 8 4 4 § “Density” option adds 16 bit instructions ADD. N ar, as, at AR[r] AR[s] + AR[t] 15 In assembler, density instructions are signified by the “. N” suffix. 0 r s t 1010 4 4 The C/C++ Compiler infers 16 -bit instructions automatically. Copyright © 2013, Tensilica, Inc. All rights reserved. B- 28

FLIX – Flexible Length Xtensions • Create multi-issue VLIW-style processor to boost processor performance – FLIX instructions can be 32, 64 or 128 bits wide (choose one) – Modeless intermixing of 16 -bit, 24 -bit, and wide instructions • Eliminates VLIW-style code-bloat • Designer-defined formats, # of slots in each format, operations in each slot – Any combination of most base ISA and TIE operations in each slot • Compiler automatically generates instruction bundles from standard C Code to improve performance Designer-Defined FLIX Instruction Formats with Designer-Defined Number of Operations 63 0 Operation 1 63 Operation 2 Operation 3 1 1 1 0 Example 3 – Operation, 64 b Instruction Format Operation 1 Operation 2 Op 3 Op 4 0 Operation 5 1 1 1 0 Example 5 – Operation, 64 b Instruction Format 0 1 Copyright © 2013, Tensilica, Inc. All rights reserved. 29

Xtensa Instruction Pipeline 1 2 3 4 5 Instruction Fetch Register Read Execute Memory Access Writeback • Instructions are executed in a RISC pipeline – This is the minimal, 5 -stage pipeline – Instructions generally spend 1 clock cycle in each stage – Pipeline stages of multiple instructions are overlapped in the pipeline 1. Instruction Fetch: instruction memory read 2. Register Read: instruction decode, and register operand read 3. Execute: ALU operation, or effective address calculation for load/store 4. Memory Access: read of local memory or cache 5. Writeback: register or memory write (instruction committed) Copyright © 2013, Tensilica, Inc. All rights reserved. 30

Notation: Pipeline Diagrams Read Instruction Memory and align instructions Decode instruction and Reg. File access Local Memory / Cache at Computation, or load/store address calculation Writeback Reg. File Update ar as Send address to Inst Mems Memory Access Execute ALU Inst Memory PC Register Read Reg. File Access Instruction Fetch Inst Decode (Prefetch) Data Memory/Cache Loads Write result to AR Reg. File Stage ALU result (Commit) – This example is for a 5 -Stage pipeline – This is a sequence diagram, not a block diagram! • “Reg. File Access” (read) in R-Stage and “Reg. File Update” (write) in W-stage refer to different operations on the same (AR) register file – Prior to I-Stage, the program counter stage (P-Stage) is sometimes shown • P-Stage is almost always overlapped with other stages, so it is not generally illustrated. Copyright © 2013, Tensilica, Inc. All rights reserved. B- 31

Xtensa 5 -Stage Pipeline (Instruction Execution) 6000117 f: 60001181: 60001183: E W a 3 Regfile Update result ALU a 2 M a 5 Inst Memory R Regfile Access PC I Inst Decode (P) . . . add. n a 3, a 5, a 2. . . Send address to Inst Mems Read Inst Memory and align instructions Decode instruction and access Reg. File Computation: a 2 + a 5 Copyright © 2013, Tensilica, Inc. All rights reserved. Stage result Cycle reserved for Data Mem Access for Loads Write result to a 3 in the Reg. File 32

Example 32 -bit Load Instruction 6000117 f: 60001181: 60001183: Send address to Inst Mems Read Inst Memory and align instructions Decode instruction and access Reg. File M Addr. Gen 0 immediate E address Data Memory W a 3 Regfile Update a 5 Inst Memory R Regfile Access PC I Inst Decode (P) . . . l 32 i. n a 3, a 5, 0. . . Address Generation: a 5 + 0 Copyright © 2013, Tensilica, Inc. All rights reserved. Local memory read or Cache access Write result to a 3 in the Reg. File 33

Example 32 -bit Store Instruction 6000117 f: 60001181: 60001183: Send address to Inst Mems Decode instruction and access Reg. File Address M W address Data Memory a 5 Addr. Gen 0 immediate E a 3 Inst Memory R Regfile Access PC I Inst Decode (P) . . . s 32 i. n a 3, a 5, 0. . . Address Generation: a 5 + 0; Read a 3 Copyright © 2013, Tensilica, Inc. All rights reserved. data (stage address and data) Local memory write 34

Instruction Design Decisions • Compile time operands – The instruction word limits the number and width of operands passed to an instruction – Fixed at compile time – Visible to the programmer • Dynamic – – Operands in the form of index(es) into a register file (compiler schedules these resources) Single/Multiple register file Ctypes Visible to the programmer • Intrinsic operands – Are usually in the form of special purpose register like an Accumulator – Instruction decoder understands how to enable the use of these registers – Invisible to the programmer. • Single cycle instructions – Integer ADD, AND, • Multi-cycle instructions (resource schedule parameters) – Load/store – MAC Copyright © 2013, Tensilica, Inc. All rights reserved. 35

High Performance Techniques • Application specific instructions – SAD, CRC, AES, DES • Fusion – Merging serial operations into fused operation – Load/Store merge with pointer math • SIMD – Single Instruction Multiple Data – Perform same operation across multiple elements of a vector word • VLIW – Long Instruction Word – Multiple operations in a single instruction word – All operations execute in the same clock cycle Copyright © 2013, Tensilica, Inc. All rights reserved. 36

Performance Techniques: Fusion Original C Code Compiled Assembly for(i=0; i<SIZE; i++){ sum +=(A[i]*B[i])<< 2; } … mul a 13, a 10, a 8; slli a 12, a 13, 2; … Compiled Assembly with a Fusion operation (merging mul and slli) … mulshift … x a 12, a 10, a 8; X, << cycle 1 << 2 cycle 2 Fusion – Merging sequential operations to a single operation Copyright © 2013, Tensilica, Inc. All rights reserved. 37

Performance Techniques: SIMD Original C Code Xtensa Processor with a SIMD operation (add operation on 4 data) Typical Processor for(i=0; i<SIZE; i++) sum[i] = A[i] + B[i]; + iteration 0 + = … A[] … B[] … sum iteration 1 + SIMD – Single operation on multiple data Copyright © 2013, Tensilica, Inc. All rights reserved. 38

Performance Techniques: VLIW Original C Code for (i=0; i<n; i++) c[i]= (a[i]+b[i])>>2; cycle 3 cycle 8 Compiled Assembly loop: … addi l 32 i add srai addi s 32 i … a 9, a 11, a 8, a 9, a 10, a 11, a 12, a 10, a 12, a 13, 4; 4; 0; 0; a 8; 2; 4; 0; Compiled Assembly with a 64 -bit FLIX (bundling 3 operations in 64 -bit FLIX inst. ) loop: { addi ; ; ; add ; l 32 i } srai ; l 32 i } nop ; s 32 i } FLIX – Bundling multiple operations in a single instruction word Copyright © 2013, Tensilica, Inc. All rights reserved. 39

$A Simple Example mytiefile. tie operation ADD_BYTES {out AR sum, in AR fourbytes }$

A Simple Example mytiefile. tie operation ADD_BYTES {out AR sum, in AR fourbytes } {} { assign sum = fourbytes[7: 0] + fourbytes[15: 8] + fourbytes[23: 16] + fourbytes[31: 24]; } Behavioral Description § The combinational logic between operands ü In this example, the logic is between two registers of the AR register file ü By default, operation executes in a single cycle § Syntax is similar to Verilog § The logic is described in expressions: Begin with assign or wire ü assign: Assignment to any “out” or “inout” operand ü wire: Instantiates a local variable that can only be assigned once (More about wires later). Copyright © 2013, Tensilica, Inc. All rights reserved. 40

Using TIE State in an Instruction mac. tie operation MAC 24 {in AR m 0, in AR m 1} {inout ACCUM} { assign ACCUM = ACCUM + m 0[23: 0] * m 1[23: 0]; } • A TIE state operand is listed in the second set of “{ }” in the operation definition • A TIE state is an implicit operand in the sense that it does not appear in the assembly syntax or C intrinsic of the instruction mac. c unsigned x, y; MAC 24(x, y); // ACCUM += x*y (24 -bit multiply) Copyright © 2013, Tensilica, Inc. All rights reserved. 41

SIMD Example: 4 -Way Add Operation vec 4_add 16. tie regfile simd 64 64 16 v // 16 x 64 bit wide registers operation vec 4_add 16 {out simd 64 sum, in simd 64 A, in simd 64 B} {} { wire [15: 0] result 0 = (A[15: 0] + B[15: 0]); wire [15: 0] result 1 = (A[31: 16] + B[31: 16]); wire [15: 0] result 2 = (A[47: 32] + B[47: 32]); wire [15: 0] result 3 = (A[63: 48] + B[63: 48]); assign sum = {result 3, result 2, result 1, result 0}; } § The new register file operands are explicit operands of the operation § Similar to using the AR register file as inputs/output in previous examples Copyright © 2013, Tensilica, Inc. All rights reserved. 42

SIMD Example: 4 -Way Add Example (2) Now let’s use our register files from C code: simd 64 A[VECLEN]; simd 64 B[VECLEN]; simd 64 sum[VECLEN]; for (i=0; i<VECLEN; i++){ sum[i] = vec 4_add 16(A[i], B[i]); } § The register file’s name(simd 64) is used as a new data type in C/C++. Variables of this type will be mapped by the C compiler to registers from the simd 64 register file Note: You may define or more data types for a given register file using the “ctype” construct. Copyright © 2013, Tensilica, Inc. All rights reserved. 43

Operator Overloading • Enables use of standard C language operators such as “+” with userdefined data types. • Simpler, more portable “native C” programming model as opposed to using intrinsics. • The C compiler can infer an operation based on data types of the operator arguments. simd 64 a, b, c; c = vec 4_add 16(a, b); c = a + b; // using intrinsics // using operator overloading Copyright © 2013, Tensilica, Inc. All rights reserved. 44

Scheduling TIE Operations § TIE compiler assumes a single-cycle schedule ü Input registers used at the beginning of the (E)xecute stage ü Output registers defined at the end of the (E)xecute stage • Use schedule to define multi-cycle operations – Read inputs in use stages – Write outputs, states and wires in def stages – Use symbolic pipeline stage names operation MACC {inout MRF acc, in MRF mul 1, in MRF mul 2} {} { assign acc = TIEmac(mul 1[23: 0], mul 2[23: 0], acc, 1’b 1, 1’b 0); } schedule macc_sched {MACC} { // Read operands at start of Estage (stage 1) use mul 1 Estage; use mul 2 Estage; use acc Estage; // Write results at end of Estage+1 (stage 2) def acc Estage+1; } Copyright © 2013, Tensilica, Inc. All rights reserved. 45

Back-to-Back MACC Pipeline Diagram with Data Dependency Cycle 0 my 1 my 2 my 5 MACC Estage … macc my 5, my 1, my 2 macc my 5, my 3, my 4 … Cycle 1 MACC Estage+1 Cycle 2 my 5 bubble my 3 If a data dependency exists in the source code, the processor inserts execution bubbles (delay cycles) until input operands are available. my 4 my 5 Copyright © 2013, Tensilica, Inc. All rights reserved. MACC Estage+1 46

Two Cycle Operations using schedule Decoder § Two-cycle MACC R ü Inputs registers are used at the beginning of the E stage MRF ü Output registers are defined at the end of the E+1 stage Control Source routing ALU E MACC M § The data path for this 2 -cycle operation is spread across the E and E+1 stages § This simple schedule does not explicitly partition the hardware between the two pipelined stages. (We need to use “retiming” in the synthesis flow) Result routing See the TIE Reference Manual for more details Copyright © 2013, Tensilica, Inc. All rights reserved. 47

Improved MACC Operation Schedule • Do not need to use acc until Estage+1 operation MACC {inout MRF acc, in MRF mul 1, in MRF mul 2} {} { assign acc = TIEmac(mul 1, mul 2, acc, 1’d 0); } schedule macc_sched {MACC} { use mul 1 Estage; // read at start of Estage (stage 1) use mul 2 Estage; use acc Estage + 1; // read at start of Estage+1 (stage 2) def acc Estage + 1; // write at end of Estage+1 (stage 2) } Pipe Stage mul 1 E E+1 MACC Partial logic mul 2 MACC Partial Logic acc Copyright © 2013, Tensilica, Inc. All rights reserved. 48

Back-to-Back MACC Pipeline Diagram – Improved Scheduling Cycle 0 my 1 MACC Estage my 2 … macc my 5, my 1, my 2 macc my 5, my 3, my 4 … Cycle 1 MACC Estage+1 my 5 my 3 MACC Estage my 4 Copyright © 2013, Tensilica, Inc. All rights reserved. Cycle 2 my 5 “use acc Estage+1” allows bypass for data dependent MACCs. MACC Estage+1 my 5 49

Methods of Reducing TIE Area • • x x + + Two multiply operations • How do we share the multipliers? Design with shared functions and semantics. regfile SR 64 4 s operation VECMUL 16 {out SR srr, in SR srs, in SR srt} {} { wire [31: 0] mtmp 1 = srs[15: 0] * srt[15: 0]; wire [31: 0] mtmp 2 = srs[47: 32] * srt[47: 32]; assign srr = {mtmp 2, mtmp 1}; } operation VECMAC 16 {inout SR srr, in SR srs, in SR srt} {} { wire [31: 0] mtmp 1 = srs[15: 0] * srt[15: 0]; wire [31: 0] mtmp 2 = srs[47: 32] * srt[47: 32]; assign srr = { srr[63: 32] + mtmp 2, srr[31: 0] + mtmp 1 }; } Copyright © 2013, Tensilica, Inc. All rights reserved. 50

Nested Function Example Myfunction 1. tie as 8 x 4 function calls operation ADD 8 x 4 {out AR sum, in AR in 0, in AR in 1}{}{ addsub function Two separate copies of assign sum = as 8 x 4(in 0, in 1, 1’b 1); as 8 x 4 } operation SUB 8 x 4 {out AR diff, in AR in 0, in AR in 1}{}{ assign diff = as 8 x 4(in 0, in 1, 1’b 0); Hardware: } Each as 8 x 4 function [31: 0] as 8 x 4 {[31: 0] a, [31: 0] b, add) { has 4 copies of addsub wire [7: 0] t 0 = addsub(a[ 7: 0], b[ 7: 0], add); wire [7: 0] t 1 = addsub(a[15: 8], b[15: 8], add); wire [7: 0] t 2 = addsub(a[23: 16], b[23: 16], add); wire [7: 0] t 3 = addsub(a[31: 24], b[31: 24], add); assign as 8 x 4 = {t 3, t 2, t 1, t 0}; } function [7: 0] addsub {[7: 0] a, [7: 0] b, add) {. . } 8 addsub modules are instanced in HW Copyright © 2013, Tensilica, Inc. All rights reserved. 51

Shared Function • Definition – A single copy of hardware shared for all TIE operations – Add the “shared” keyword to function description • Benefits – Reduces area – Enables iterative operations (discussed later) • Limitations • A shared function should be kept simple, as it cannot be scheduled across more than one clock cycle • A shared function cannot be nested as 8 x 4 function calls operation ADD 8 x 4 {out AR sum, in AR in 0, in AR in 1}{}{ addsub function Hardware: Operations share one assign sum = as 8 x 4(in 0, in 1, 1’b 1); hardware instance of } as 8 x 4 operation SUB 8 x 4 {out AR diff, in AR in 0, in AR in 1}{}{ assign diff = as 8 x 4(in 0, in 1, 1’b 0); } function [31: 0] as 8 x 4 {[31: 0] a, [31: 0] b, add) shared {. . } Copyright © 2013, Tensilica, Inc. All rights reserved. 52

Sharing Hardware among Operations: semantic regfile SR 64 4 s operation VECMUL 16 {out SR srr, in SR srs, in SR srt} {} { wire [31: 0] mtmp 1 = srs[15: 0] * srt[15: 0]; wire [31: 0] mtmp 2 = srs[47: 32] * srt[47: 32]; assign srr = {mtmp 2, mtmp 1}; } operation VECMAC 16 {inout SR srr, in SR srs, in SR srt} {} { wire [31: 0] mtmp 1 = srs[15: 0] * srt[15: 0]; wire [31: 0] mtmp 2 = srs[47: 32] * srt[47: 32]; assign srr = { srr[63: 32] + mtmp 2, srr[31: 0] + mtmp 1 }; } Operation name used as qualifier semantic arith {VECMUL 16, VECMAC 16} { wire [31: 0] atmp 1 = VECMAC 16 ? srr[31: 0] : 0; wire [31: 0] atmp 2 = VECMAC 16 ? srr[63: 32] : 0; wire [31: 0] mtmp 1 = TIEmac(srs[15: 0], srt[15: 0], atmp 1, 1'b 0); wire [31: 0] mtmp 2 = TIEmac(srs[47: 32], srt[47: 32], atmp 2, 1'b 0); assign srr = {mtmp 2, mtmp 1}; } Copyright © 2013, Tensilica, Inc. All rights reserved. 53

FLIX – Flexible Length Xtensions • Create multi-issue VLIW-style processor to boost processor performance – FLIX instructions can be 32, 64 or 128 bits wide (choose one) – Modeless intermixing of 16 -bit, 24 -bit, and wide instructions • Eliminates VLIW-style code-bloat • Designer-defined formats, # of slots in each format, operations in each slot – Any combination of most base ISA and TIE operations in each slot • Compiler automatically generates instruction bundles from standard C Code to improve performance Designer-Defined FLIX Instruction Formats with Designer-Defined Number of Operations 63 0 Operation 1 63 Operation 2 Operation 3 1 1 1 0 Example 3 – Operation, 64 b Instruction Format Operation 1 Operation 2 Op 3 Op 4 0 Operation 5 1 1 1 0 Example 5 – Operation, 64 b Instruction Format 0 1 Copyright © 2013, Tensilica, Inc. All rights reserved. 54

$TIE Language Reference: format § Format: format name width {slot_name 0, slot_name 1, …}$

TIE Language Reference: format § Format: format name width {slot_name 0, slot_name 1, …} ü Name: Name of the format ü Width: Wide instruction word width (32 or 64 or 128 bits) ü slot_name list: List of slots and their names (at most 15 slots) • TIE compiler computes width of each slot § Example: format myflix 2 64 {slot_a, slot_b, slot_c} 64 -bit long slot _a slot_b slot_c Copyright © 2013, Tensilica, Inc. All rights reserved. 55

$FLIX Example myflix. tie format myflix 1 64 {slot_a, slot_b, slot_opcodes slot_a {L 32$

FLIX Example myflix. tie format myflix 1 64 {slot_a, slot_b, slot_opcodes slot_a {L 32 I, S 32 I} slot_opcodes slot_b {ADDI} slot_opcodes slot_c {ADD, SRAI} loop: { l 32 i a 8, a 9, 0 ; addi a 9, 4 { l 32 i a 10, a 11, 0 ; addi a 11, 4 { s 32 i a 12, a 13, 0 ; addi a 13, 4 slot_a slot_c} ; add a 12, a 10, a 8} ; srai a 12, 2} ; nop} slot_b slot_c § The TIE compiler will create FLIX instructions (bundles of operations) for all possible combinations of slot opcodes (including NOP). § The C compiler will automatically infer FLIX instructions from C code to improve performance. No assembly programming required! Copyright © 2013, Tensilica, Inc. All rights reserved. 56

$Multiple FLIX Formats myflix. tie format myflix 1 64 {slot_a, slot_b, format myflix 2$

Multiple FLIX Formats myflix. tie format myflix 1 64 {slot_a, slot_b, format myflix 2 64 {slot_a, slot_d} slot_opcodes slot_a {L 32 I, S 32 I} slot_opcodes slot_b {ADDI} slot_opcodes slot_c {ADD, SRAI} slot_opcodes slot_d {bigtie} loop: { l 32 i a 8, a 9, 0 { l 32 i a 10, a 11, 0 slot_c} ; addi a 9, 4 ; add a 12, a 10, a 8 } ; bigtie a 3, m 9, m 12, 64 } § Multiple Formats can be used to optimize utilization of instruction bits. A format with fewer slots can support operations that require many operands. Copyright © 2013, Tensilica, Inc. All rights reserved. 57