1 Faster unicores are still needed Andr Seznec

1 Faster unicores are still needed André Seznec INRIA/IRISA

2 DAL: Defying Amdahl’s Law • ERC advanced grant to A. Seznec (2011 -2016)

3 Multicores are everywhere • Multicores in servers, desktop, laptops § • Multicores in

4 Multicore/multithread for everyone • End-user : improved usage comfort § • Can surf

No parallel software bonanza in the near future • Inheritage of sequential legacy codes

6 Inheritage of sequential legacy codes • Software is more resilient than hardware §

Parallelism is not cost-effective for most apps • Why parallelism ? § • Only

Sequential programming will remain dominant § § 8 Just easier § The « Joe

2002: The End of the Uniprocessor Road • Power and temperature walls: § •

Marketing multicores to the masses 2002 -. . SMT Dual-core SMT GREAT !! Quad-core

12 And now ? The end user is not such a fool. .

13 Following the trend: 2020 • Silicon area, power envelope § ≈ 100 Nehalem

14 Amdahl’s Law “Cannot run faster than sequential part” seq. parallel

15 OK, parallel applications do not scale • Our recent study on parallel application

But let us use a naive (overoptimistic) model • 16 A parallel application: §

17 Complex cores against simple cores • CC: 100 complex vs SC : 1000

18 And hybrid SC + CC ? CC_SC: § § 50 complex 500 simple

19 And if. . • Use a huge amount of resource for a single

20 DAL architecture proposition • Heterogeneous architecture: § A few ultra complex cores §

21 For the naive model « DAL » : UC_SC 5 ultra complex cores

Need for research on faster unicores • Silicon area is 2 nd order issue

23 On going work: Revisiting Value Prediction with Arthur Pérais

Value prediction ? 24 Lipasti et al, Gabbay and Mendelson 1996 Basic idea: §

Value Prediction: • Large body of research 96 -02 • Quite efficient: § •

26 Last Value Predictor • Just predict the last produced value § Set Associative

27 Stride value predictor • Add last value + (last difference) P C +

28 Finite Context Method predictors Use history of the last values by the instruction

branch 29 And global value history • Just no sense ! § Need the

30 ITTAGE VTAGE pc h[0: L 1] pc =? 32 32 pc h[0: L

31 The repair issue on misprediction I 0 misprediction I 1 I 3 I

32 Pipeline squash I 0 I 1 I 3 I 4 I 5 •

33 Selective replay I 0 I 1 I 3 I 4 I 5 •

34 Critical path • Predicted value needed late in the pipeline: § • Disptach

35 A FCM implementation issue Speculative Window P C Might be a critical path

Critical path on the stride value predictor P C + Speculative Window Can be

37 Experiments • 8 -way superscalar, deep pipeline • Use prediction only on high

0, 8 470. lbm 464. h 264 458. sjeng 456. hmmer 445. gobmk 444.

High confidence through probabilistic counters • Need for very high confidence: § § 95

43 Current status • All value predictors amenable to very high confidence § •

44 On going work: Selective Prediction of Predicated Instructions with Nathanael Prémillieu

Who cares about predicated instructions ? • CMOV in all ISA • ARM, Itanium

46 The multiple definition problem Before renaming: I 1: R 1 I 2: R

47 Expansion/Serialization After renaming: I 1 a: P 1 I 1 b: P 27

48 Aggressive serialization I 1: P 18 I 2: P 13 (p) ? (op

49 Predicting the predicates • branch history or branch+predicate history to predict the predicates

-20 400. perlbench. checkspam 400. perlbench. diffmail 401. bzip 2. chicken 401. bzip 2.

51 • Filter the predicate prediction • Replay at rename time the mispredicted predicates

10 8 400. perlbench. checkspam 400. perlbench. diffmail 401. bzip 2. chicken 401. bzip

53 • Predicate prediction + filtering allows: Better performance Without aggressive out-of-order implementation •

54 Conclusion Faster cores are needed: Amdahl’s law, Uniprocessor workload Silicon, power, etc are

Slides: 54

Download presentation

1 Faster unicores are still needed André Seznec INRIA/IRISA

2 DAL: Defying Amdahl’s Law • ERC advanced grant to A. Seznec (2011 -2016) DAL objective: « Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020 General Purpose manycore »

3 Multicores are everywhere • Multicores in servers, desktop, laptops § • Multicores in smart phones, tablets § • 2 -4 -8 -12 O-O-O cores 2 -4 -(not that simple) cores Manycores for niche markets § 48 -80 -100 simple cores § Tilera, Intel Phi

4 Multicore/multithread for everyone • End-user : improved usage comfort § • Can surf on the web and hear MP 3 Parallel performance for the masses? § Very few (scalable) mainstream // apps § Graphics § Niche market segments

No parallel software bonanza in the near future • Inheritage of sequential legacy codes • Parallelism is not cost-effective for most apps • Sequential programming will remain dominant 5

6 Inheritage of sequential legacy codes • Software is more resilient than hardware § Apps are surviving/evolving for years, often decades § Very few parallel apps now • Unlikely redevelopment of parallel apps from scratch • Computing intensive sections will be parallelized § But significant code sections will remain sequential

Parallelism is not cost-effective for most apps • Why parallelism ? § • Only for performance But costly: § § Difficult, man-time consuming, error prone Poorly portable: functionality and performance 7

Sequential programming will remain dominant § § 8 Just easier § The « Joe » programmer § Portability, maintenance, debug + compiler to parallelize + parallel libraries + software components (developped by experts)

9 Looking backwards

2002: The End of the Uniprocessor Road • Power and temperature walls: § • Stopped the frequency increase 2 x transistors: 5 %? 10 % ? perf. (if any) economical logic : buy smaller chips ! IC industry needs to sell new (expensive) chips: Marketing: « You need hyperthreading, 2, 4, 8 cores » 10

Marketing multicores to the masses 2002 -. . SMT Dual-core SMT GREAT !! Quad-core SMT 11

12 And now ? The end user is not such a fool. .

13 Following the trend: 2020 • Silicon area, power envelope § ≈ 100 Nehalem class cores or § ≈ 1, 000 simple cores (VLIW, in-order superscalar)

14 Amdahl’s Law “Cannot run faster than sequential part” seq. parallel

15 OK, parallel applications do not scale • Our recent study on parallel application scaling: Execution time Input set Processor number • In general: bp> -1 : sublinear scaling • Sometimes: bs > 0 : sequential part increases

But let us use a naive (overoptimistic) model • 16 A parallel application: § Parallel section: can use 1000 processors § Sequential section: run on a single processor SEQ: constant fraction of sequential code linear speed-up

17 Complex cores against simple cores • CC: 100 complex vs SC : 1000 simple cores with complex 2 X faster than simple if SEQ > 0. 8 % then CC > SC

18 And hybrid SC + CC ? CC_SC: § § 50 complex 500 simple if SEQ> 0. 2% then CC_SC > SC

19 And if. . • Use a huge amount of resource for a single core: 10 X the area of the complex core 10 X the power of the complex core Use all the uniprocessor techniques § Very wide issue (8 – 16 ? ), Ultimate frequency ( « heat and run » ), Helper threads, Value prediction Invent new techniques Ultra Complex cores

20 DAL architecture proposition • Heterogeneous architecture: § A few ultra complex cores § § to enable performance on sequential codes and/or critical sections A « sea » of simple cores § for parallel sections

21 For the naive model « DAL » : UC_SC 5 ultra complex cores + 500 simple cores • If SEQ > 0. 13 % then « DAL » > SC • « DAL » always better than UC, CC_SC

Need for research on faster unicores • Silicon area is 2 nd order issue can use the area of 10 complex cores • Power/energy is 2 nd order issue can use the power of 10 complex cores 22

23 On going work: Revisiting Value Prediction with Arthur Pérais

Value prediction ? 24 Lipasti et al, Gabbay and Mendelson 1996 Basic idea: § Eliminate (some) true data dependencies through predicting instruction results +2 I 0 +3 I 1 +1 I 3 +3 I 4 I 5

Value Prediction: • Large body of research 96 -02 • Quite efficient: § • Surprisingly high number of predictable instructions Not implemented so far: § High cost : is it still relevant now ? § High penalty on misp. : don’t lose all the benefit 25

26 Last Value Predictor • Just predict the last produced value § Set Associative Table § Use confidence counters Analogy with PC-based branch prediction

27 Stride value predictor • Add last value + (last difference) P C + Analogy with stride prefetcher, but also with loop predictor

28 Finite Context Method predictors Use history of the last values by the instruction P C Analogy with local history branch predictor

branch 29 And global value history • Just no sense ! § Need the history of the last instructions § • Too late !! But global branch history !? ! § ITTAGE is the state-of-the-art indirect branch predictor !! § And it predicts values !

30 ITTAGE VTAGE pc h[0: L 1] pc =? 32 32 pc h[0: L 3] pc h[0: L 2] =? 1 32 =? 32 1 1 32 Tagless base Predictor 32 prediction Longest matching component provides the prediction

31 The repair issue on misprediction I 0 misprediction I 1 I 3 I 4 I 5

32 Pipeline squash I 0 I 1 I 3 I 4 I 5 • Acts as on exception, branch misprediction • Very high penalty

33 Selective replay I 0 I 1 I 3 I 4 I 5 • Cancel all dependent instructions, but save the others • Very complex to implement: § Unlimited dependence chains

34 Critical path • Predicted value needed late in the pipeline: § • Disptach time is sufficient Except that:

35 A FCM implementation issue Speculative Window P C Might be a critical path Must take the last local values

Critical path on the stride value predictor P C + Speculative Window Can be reused on the next cycle Stride AND spec. last value must be high confidence 36

37 Experiments • 8 -way superscalar, deep pipeline • Use prediction only on high confidence § § 3 -bit counters + saturated + reset

0, 8 470. lbm 464. h 264 458. sjeng 456. hmmer 445. gobmk 444. namd 433. milc 429. mcf 416. gamess 403. gcc 401. bzip 255. vortex 197. parser 186. crafty 179. art 175. vpr 173. applu 168. wupwise 164. gzip 38 Squashing 1, 4 1, 3 1, 2 LVP stride FCM VTAGE 1, 1 1 0, 9

16 16 4. g z 8. w ip up w 17 ise 3. ap pl u 17 5. vp r 17 9. 18 art 6. cr 19 afty 7. pa 25 rser 5. vo rte 40 x 1. bz ip 40 3. gc 41 c 6. ga m es 42 s 9. m cf 43 3. m ilc 44 4. na m 44 d 5. go b 45 m k 6. hm m 45 er 8. sje 46 ng 4. h 2 6 47 4 0. lb m 39 Selective replay 1, 4 1, 35 1, 3 1, 25 1, 2 1, 15 1, 1 1, 05 1 0, 95 0, 9 LVP Stride FCM VTAGE

High confidence through probabilistic counters • Need for very high confidence: § § 95 % accuracy unsufficient >> 99 % needed TRADING ACCURACY AGAINST COVERAGE • Saturation with only very low probability § 1/32, 1/256 40

ise 17 3. ap pl u 17 5. vp r 17 9. ar t 18 6. cr af ty 19 7. pa rs er 25 5. vo rte x 40 1. bz ip 40 3. gc 41 c 6. ga m es s 42 9. m cf 43 3. m ilc 44 4. na m 44 d 5. go bm 45 k 6. hm m er 45 8. sje ng 46 4. h 2 64 47 0. lb m up w gz ip 4. 16. w 16 8 41 Squashing 1, 4 1, 35 1, 3 1, 25 1, 2 1, 15 1, 1 1, 05 1 LVP Stride FCM VTAGE 0, 95 0, 9

16 164 8. . gz w ip up 17 wis 3. e ap p 17 lu 5. v 17 pr 9 18. a 6. rt 19 craf 7. ty p 25 ars 5. er vo r 40 tex 1. b 4 zip 41 03. 6. gcc ga m 42 ess 9. m 43 cf 3. 44 mi 4 lc 44. nam 5. d 45 gob m 6. hm k 45 me 8. r s 46 jen 4. g h 2 47 64 0. lb m 42 And hybrids 1, 5 1, 4 1, 3 1, 2 1, 1 1 0, 9 Stride VTAGE-Stride 3 c-Hybrid

43 Current status • All value predictors amenable to very high confidence § • No complex selective repair needed No need for local value prediction § No complex critical path in the local value predictor

44 On going work: Selective Prediction of Predicated Instructions with Nathanael Prémillieu

Who cares about predicated instructions ? • CMOV in all ISA • ARM, Itanium : § All instructions are predicated out-of-order execution: just a nightmare 45

46 The multiple definition problem Before renaming: I 1: R 1 I 2: R 4 Mapping Table R 2, R 3 (p) R 1, R 2 R 1 P 11 R 2 P 15 R 3 P 22 After renaming: I 1: P 1 I 2: P 13 P 15, P 22 (p) ? ? ? , P 15 R 1 P 1 || P 11 R 2 P 15 R 3 P 22 R 4 P 13

47 Expansion/Serialization After renaming: I 1 a: P 1 I 1 b: P 27 I 2: P 13 P 15, P 22 (p) ? P 1, P 11 P 27, P 15 • Create an extra instruction • Force I 1 b I 2 dependency R 1 P 27 R 2 P 15 R 3 P 22 R 4 P 13

48 Aggressive serialization I 1: P 18 I 2: P 13 (p) ? (op P 15, P 22) : P 23 P 18, P 15 R 1 P 18 R 2 P 15 R 3 P 22 R 4 P 13 • No expansion, but an extra operand on I 1: • complexity on register file, issue logic, bypass network • Force I 1 I 2 dependency

49 Predicting the predicates • branch history or branch+predicate history to predict the predicates Ø Eliminate multiple definitions Ø Predicate mispredictions become branch mispredictions

-20 400. perlbench. checkspam 400. perlbench. diffmail 401. bzip 2. chicken 401. bzip 2. combined 401. bzip 2. liberty 401. bzip 2. program 401. bzip 2. source 401. bzip 2. text 403. gcc. 166 403. gcc. 200 403. gcc. c-typeck 403. gcc. cp-decl 403. gcc. expr 403. gcc. scilab 416. gamess. cytosine 416. gamess. h 2 ocu 2+ 429. mcf. ref 435. gromacs. ref 436. cactus. ADM. ref 444. namd. ref 445. gobmk. 13 x 13 445. gobmk. nngs 445. gobmk. trevorc 445. gobmk. trevord 453. povray. ref 456. hmmer. nph 3 456. hmmer. retro 458. sjeng. ref 459. Gems. FDTD. ref 462. libquantum. ref 464. h 264 ref. baseline 464. h 264 ref. main 464. h 264 ref. sss 470. lbm. ref 471. omnetpp. ref 473. astar. Big. Lakes 473. astar. rivers 483. xalancbmk. ref 50 Not that convincing ! 10 5 0 -5 -10 Br & Pred Branch -15

51 • Filter the predicate prediction • Replay at rename time the mispredicted predicates

10 8 400. perlbench. checkspam 400. perlbench. diffmail 401. bzip 2. chicken 401. bzip 2. combined 401. bzip 2. liberty 401. bzip 2. program 401. bzip 2. source 401. bzip 2. text 403. gcc. 166 403. gcc. 200 403. gcc. c-typeck 403. gcc. cp-decl 403. gcc. expr 403. gcc. scilab 416. gamess. cytosine 416. gamess. h 2 ocu 2+ 429. mcf. ref 435. gromacs. ref 436. cactus. ADM. ref 444. namd. ref 445. gobmk. 13 x 13 445. gobmk. nngs 445. gobmk. trevorc 445. gobmk. trevord 453. povray. ref 456. hmmer. nph 3 456. hmmer. retro 458. sjeng. ref 459. Gems. FDTD. ref 462. libquantum. ref 464. h 264 ref. baseline 464. h 264 ref. main 464. h 264 ref. sss 470. lbm. ref 471. omnetpp. ref 473. astar. Big. Lakes 473. astar. rivers 483. xalancbmk. ref 52 Non. Agressive NA SPREPI Ag. SPREPI 6 4 2 0 -2 -4

53 • Predicate prediction + filtering allows: Better performance Without aggressive out-of-order implementation • Current compilers « shy » on predication usage might be worth to reconsider

54 Conclusion Faster cores are needed: Amdahl’s law, Uniprocessor workload Silicon, power, etc are available: Just grab the resource from the rest of the system Do research as if (area, power) was not a constraint: Then, take into account the constraints (or somebody else will manage to do it)