Hiding Synchronization Delays in a GALS Processor Microarchitecture

Why GALS? Simplified clock distribution network n Reduced clock power dissipation n Allows modular

But there is a cost… Inter-domain synchronization can hurt performance n Synchronization circuit costs

The MCD Microprocessor CPU Integer IIQ Frontend ROB fetch IFQ dispatch L 1 instr.

Inter-domain Synchronization n Queue design based on Chelcea and Nowick (WVLSI ’ 00) ¨

Synchronization via Queues FIFO Queue ASYNC 2004 - University of Rochester Issue Queue 6

Timing Analysis n 1 4 CLK 1 2 3 n n CLK 2 T

Simulation Methodology n Two processor pipelines ¨ Alpha 21264 ¨ Strong. ARM SA-1110 Synchronization

Simulation Methodology Simplescalar + Wattch + MCD n Independent clock for each domain n

Synchronization Analysis n Oo. O and superscalar capabilities removed from Alpha ASYNC 2004 -

Synchronization Analysis n Oo. O and superscalar capabilities added to Strong. ARM ASYNC 2004

What we have learned n Synchronization penalty doesn’t mean performance loss n n n

Conclusions n GALS is a good idea for real processors ¨ small IPC loss

Slides: 13

Download presentation

Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho Sandhya Dwarkadas ASYNC 2004 - University of Rochester

Why GALS? Simplified clock distribution network n Reduced clock power dissipation n Allows modular design of the processor n Can run each domain at optimal frequency n Can use conventional design and testing methods n Fine-grained DVS/DFS n ASYNC 2004 - University of Rochester 2

But there is a cost… Inter-domain synchronization can hurt performance n Synchronization circuit costs in area and power n n We have to be careful how we divide the processor ASYNC 2004 - University of Rochester 3

The MCD Microprocessor CPU Integer IIQ Frontend ROB fetch IFQ dispatch L 1 instr. cache branch predict rename ASYNC 2004 - University of Rochester Memory LS Q int. register file int. FUs L 2 L 1 data unified cache Floating Pt fp. FIQ register file Main Memory fp. FUs 4

Inter-domain Synchronization n Queue design based on Chelcea and Nowick (WVLSI ’ 00) ¨ Modified n for Issue Queue configuration Synchronization circuit based on Nyström and Martin (WCED ’ 02) ¨ Converted n to single-rail logic Timing analysis based on Sjogren and Myers (ARVLSI ’ 97) ¨ Skip a cycle rather than pause the clock ASYNC 2004 - University of Rochester 5

Synchronization via Queues FIFO Queue ASYNC 2004 - University of Rochester Issue Queue 6

Timing Analysis n 1 4 CLK 1 2 3 n n CLK 2 T n n ASYNC 2004 - University of Rochester Source runs with CLK 1, destination with CLK 2 Source writes at edge 1 If T > Ts then the data can be used at edge 2 If T < Ts then the data can be used at edge 3 25% < Ts < 35% 7

Simulation Methodology n Two processor pipelines ¨ Alpha 21264 ¨ Strong. ARM SA-1110 Synchronization penalty was measured against an identical synchronous design n 30 benchmarks n ¨ Media. Bench, ASYNC 2004 - University of Rochester Olden, SPEC 2000 8

Simulation Methodology Simplescalar + Wattch + MCD n Independent clock for each domain n ¨ Independent jitter for each domain ¨ Next edge based on period, last edge, jitter n When source and destination clocks are too close, one cycle penalty is assessed ASYNC 2004 - University of Rochester 9

Synchronization Analysis n Oo. O and superscalar capabilities removed from Alpha ASYNC 2004 - University of Rochester 10

Synchronization Analysis n Oo. O and superscalar capabilities added to Strong. ARM ASYNC 2004 - University of Rochester 11

What we have learned n Synchronization penalty doesn’t mean performance loss n n n Out-of-order execution allows useful work to be performed when instructions are delayed Superscalar design means that synchronization penalties can be “shared” across multiple instructions For Alpha 95% of penalty hidden For Strong. ARM++ 63% of penalty hidden We have to be careful n n Cannot have too many domains Careful where you split! ASYNC 2004 - University of Rochester 12

Conclusions n GALS is a good idea for real processors ¨ small IPC loss ¨ clock network simplification ¨ reduction in power dissipation ¨ higher frequency ¨ independent domain tuning ASYNC 2004 - University of Rochester 13