Timing Anomalies in Dynamically Scheduled Microprocessors Thomas Lundqvist

Assumptions in current methods to find Worst Case Execution Time (WCET) • Execution time

Claim: Making such assumptions for dynamically scheduled processors is wrong ! • Dynamically scheduled

Organization of the presentation • Description of architectural features that may cause anomalies •

Terms and definitions • Formal definition of timing anomaly - Instruction latency same as

Timing anomaly examples • A cache-hit results in WCET • B is dependent on

Timing anomaly examples (…contd) • • • Overall miss penalty can be higher than

Timing anomaly examples (…contd) • Unbounded impact on WCET • A and B make

Limitations of previous methods • Such methods make locally safe decisions, at basic block

Methods for eliminating anomalies • The pessimistic serial-execution method - All instructions are executed

The program modification method (…contd) Making pipeline-state predictable • Forced in-order resource use is

The program modification method (…contd) Making cache state predictable • After each path invalidate

Case study: symbolic execution method • Instruction level simulation • Extended instruction semantics to

Changes to this existing method • First pass identifies all places where local decisions

Benchmarks used • PSIM, existing instruction-level simulator was extended for symbolic execution and modification

Evaluation results Program Actual WCET Unsafe WCET Ratio Serail WCET Ratio Modified slowdown matmult

Summary • Timing anomalies in dynamically scheduled processors may cause wrong WCET estimation using

Slides: 19

Download presentation

Timing Anomalies in Dynamically Scheduled Microprocessors Thomas Lundqvist, Per Stenstrom (RTSS ‘ 99) Presented by: Kaustubh S. Patil

Assumptions in current methods to find Worst Case Execution Time (WCET) • Execution time of an instruction is not fixed - Due to pipeline stalls or cache misses - Input data dependency eg. mulhw, mulhwu, mullw in Power. PC architecture • In such cases, current methods assume longest instruction latency for every instruction - eg. if the outcome of a cache access is unknown, a cache miss is assumed. - Intuition-based

Claim: Making such assumptions for dynamically scheduled processors is wrong ! • Dynamically scheduled processors - out-of-program-order instruction execution • For such processors, counter-intuitive increase or decrease in execution time is possible - eg. a cache miss can actually reduce the overall execution time. - These are termed as timing anomalies.

Organization of the presentation • Description of architectural features that may cause anomalies • Examples of timing anomalies • Handling of such anomalies in previous methods • Proposed methods to eliminate such anomalies • Case study of a previous method in the context of proposed solutions

Terms and definitions • Formal definition of timing anomaly - Instruction latency same as instruction execution time - case 1: latency of first instruction increased by i cycles - case 2: it is decreased by d cycles - C be resulting future change in execution time Definition: A situation where, in the first case, C>i or C<0, or in the second case, C<-d or C>0. • In-order and out-of-order resources • If a processor only contains in-order resources, no timing anomalies can occur

Architecture used for illustrating

Timing anomaly examples • A cache-hit results in WCET • B is dependent on A • In cache-hit case, B gets priority over C • In cache-miss case, D & E execute 1 cycle earlier • The reason for this anomaly - IU is an out-of-order resource

Timing anomaly examples (…contd) • • • Overall miss penalty can be higher than a single cache miss penalty A, B, C have dependencies C always results in a miss C finishes 11 cycles later instead of one miss penalty of 8 cycles MCIU allows B and D to execute out-of-order

Timing anomaly examples (…contd) • Unbounded impact on WCET • A and B make a loop body • Fast case - ‘A’ executes as soon as dispatched • Slow case - ‘A’ is delayed by one cycle - Old B gets priority over new A - ‘A’ gets delayed in each iteration - Total penalty k cycles if k iterations

Limitations of previous methods • Such methods make locally safe decisions, at basic block or instruction level. • Timing anomalies due to variable latency instructions and different pipeline states do not allow this. • Consider an instruction sequence with n variable latency instructions. • Each such instruction can have k different latencies. • Need to examine kn possibly different schedules

Methods for eliminating anomalies • The pessimistic serial-execution method - All instructions are executed in-order. - All memory references are considered misses. - Which instruction sequence is considered ? - Very pessimistic approach • The program modification method - All unknown events and variable latency instructions must result in a predictable pipeline state - If a path is selected as a WCET path among a set of paths, then the end cache & pipeline state must be the same.

The program modification method (…contd) Making pipeline-state predictable • Forced in-order resource use is one solution - little processor support • Use of sync instruction in Power. PC architecture - to take care of variable latency instructions - also when cache hits are unpredictable • sync works for both the previous conditions

The program modification method (…contd) Making cache state predictable • After each path invalidate all cache blocks - poor performance • Invalidate only differing cache blocks - poor performance again • Preload cache blocks - special instruction support eg. icbt, dcbt in Power. PC

Case study: symbolic execution method • Instruction level simulation • Extended instruction semantics to take care of ‘unknown’ operands eg. Add A, B, C A B+C , if both B and C are known A unknown , either B or C is unknown • Elimination of infeasible paths • Merging of paths to avoid exponential number of paths

Changes to this existing method • First pass identifies all places where local decisions need to be made - eg. merging of paths and variable latency instructions • Addition of sync and preload instructions at such sites • Tserial = sum of all latencies and misses • T = Tserial / 2 in the ideal case

Benchmarks used • PSIM, existing instruction-level simulator was extended for symbolic execution and modification of program approach • The benchmarks used were: - matmult : Multiplies 2 50*50 matrices - bsort : Bubblesort of 100 integers - isort : Insertsort of 10 integers - fib : Calculates nth element of Fibonacci sequence for n<30 - DES : Encrypts 64 bit data - jfdctint : Discrete cosine transform of an 8*8 pixel image - compress : Compresses 50 bytes of data

Evaluation results Program Actual WCET Unsafe WCET Ratio Serail WCET Ratio Modified slowdown matmult 5283287 1 10566574 2 6323287 1. 20 bsort 230490 1 460981 2 256854 1. 11 isort 2085 1 4170 2 2325 1. 12 fib 797 1 1594 2 797 1 1 DES 186166 186358 1. 001 372716 2. 002 186358 1. 001 1 jfdctint 9409 1 18819 2 9921 1. 05 compress 16846 54583 3. 31 109167 6. 62 69291 4. 20 1. 27

Summary • Timing anomalies in dynamically scheduled processors may cause wrong WCET estimation using previous methods. • Using architecture support to control state of the cache and pipeline, it is possible to eliminate anomalies and the previous methods can be used on such modified programs.

Thank you !!! Questions ?