VLIW Abstract Machine Instruction Memory Ipacketvec Register File

VLIW Abstract Machine Instruction Memory Ipacket_vec Register File address F e t c h Read Request Port Execute Unit(s) Epacket_vec Read Result Port In Port Out Port Write Port Rpacket_vec Wpacket_vec Mem Unit RF Write Unit mem read data 10/23/2021 Mpacket_vec mem write data Data Memory address

VLIW Abstract Machine Details • Expand all packet paths to contain EUMAX packets. ÞEach packet path becomes a vector of packets ÞCurrently have EUMAX = 4 in abmtypes. vhd file ÞCan execute EUMAX instructions per clock • Each packet_vec can only contain one load or store operation because there is only one path to data ÞMany loads or stores will inhibit instruction level parallelism • All forwarding paths are operational ÞCANNOT be any dependencies within the same packet_vec!!! 10/23/2021

Forwarding • Recall that forwarding for the Ab. M pipelined implementation was done with the execute module ÞTwo execute operands; had to check three paths for each operand (Mpacket, Rpacket, Wpacket) • Forwarding within Ab. M VLIW ÞStill need to check three paths for each operand ÞEach path can now contain EUMAX instructions!! ÞForwarding checks = three paths * EUMAX * 2 operands 10/23/2021

VLIW Programming Assignment (due 9/16) Write code that will perform a [1 x 4] X [4 x 4] matrix operation. The [4 x 4] matrix is T 00 T 01 T 02 T 03 T 10 T 11 T 12 T 13 T 20 T 21 T 22 T 23 T 30 T 31 T 32 T 33 The [1 x 4] matrix has values [x y z w]. The result of the multiplication is a 1 x 4 matrix [x’ y’ z’ w’] where: x’ = x*T 00 + y*T 10 + z*T 20 + w*T 30 y’ = x*T 01 + y*T 11 + z*T 21 + w*T 31 etc. These values are contained in registers R 1 -R 16, stored in row major order (R 1 = T 00, R 2 = T 01, R 3 = T 02, R 4=T 03, R 5=T 10, …). Assume that your code will be called as a subroutine, with R 17 containing the base address of the [1 x 4] input matrix (Mem[r 4+0] = x, Mem[r 4+1] = y, etc). The [1 x 4] matrix should be stored beginning at the memory location pointed to by R 18 (M[r 18+0] = x’, M[r 18+1] = y’, etc). Your code must leave registers R 1 - R 16 intact, all other register contents can be destroyed (the VLIW Ab. M has 32 registers total). 10/23/2021

Files to be modified • instmem. vhd ÞMust be edited to hold your program • regfile. vhd ÞEdit this so that the initial contents of r 1 -r 18 contain values that can be used to test your code. • datamem. vhd ÞEdit this to contain x, y, z, w values that you can use to test your code. 10/23/2021

Theoretical performance? • The number of operations in the matrix multiply is: Þ 4 Loads Þ 4 Stores Þ 16 Mults Þ 12 Adds (read [1 x 4] input matrix) (write [1 x 4] output matrix) (matrix products) (column sums) • Total ops is 36. Theoretical number of required instruction packets is 36/EUMAX. Þ If EUMAX = 4, then should only take 9 instruction packets! • You will be graded on how close you get to this limit. 10/23/2021

A Question to be answered • What is the fewest number of packets possible for this problem if we increased EUMAX, but kept the same LOAD, STORE limitations? ÞWhat would EUMAX need to be in order to use the fewest number of packets given the load/store and data dependencies? • You must show some work that validates your answer. 10/23/2021

Scheduling Approach • The problem is to schedule the 36 operations • Try to figure out the minimum number of packets needed given the load/store limitations and data dependencies of the calculations ÞDon’t worry about register assignment yet ÞInitially, only calculate intermediate values in the packet before they are actually needed; move them up to an earlier packet later on. • At this point, you will have satisfied the load/store limitations, and data dependencies. 10/23/2021

Approach (cont). • You may have exceeded the 4 operations per packet limit by only computing values in the packet before they are actually needed. Þ Count the number of free operation slots which are before the excess operations. If the number of free operation slots is greater than the excess operations, you will definitely have to add another packet. If not, then try moving the excess operations to earlier free slots. Even if enough free slots exist, you may still need to add extra packets because of load/store limitations. • Once all of the operations are scheduled (only 4 operations per packet), assign registers. ÞIf you run out of registers, you may have to add another instruction packet. 10/23/2021

Scheduling Template slot 0 slot 1 slot 2 slot 3 Excess Ops packet 0 packet 1 packet 2 packet i It will help to use a descriptive names for the operations: P 0 - P 15 (16 products), LD 0 -LD 3 (4 loads), S 0 -S 11 (12 sums) ST 0 -ST 3 (4 stores). Draw arrows to indicate dependencies. 10/23/2021