Enhancing Signature Path Prefetching with Perceptron Prefetch Filtering

Introduction Design Space: Standalone L 1 D, L 2 C and LLC Prefetchers Distribution

Key Ideas • Aggressive L 2 C Prefetching – Signature Path Prefetcher (SPP)[Kim, MICRO

Page Based Resource Sharing • Prefetch Queue (PQ) limited in number – Valuable resource

L 1 D Prefetcher: Next-N-Lines • Fetches N consecutive lines wrt current demand address

L 2 C Underlying Prefetcher: SPP • Lookahead Prefetcher – Uses previous prefetch suggestion

Enhanced SPP • Decoupled coverage and accuracy concerns – – SPP enhanced to its

Hashed Perceptron Model Use feature values to index into distinct tables – Example: PC,

PPF Architecture • Baseline prefetcher: SPP – Modified for high coverage • Perceptron Weights

PPF Design • • Prefetch suggestions tested using PPF Outcome and indexing metadata recorded

Putting Pieces Together Single Core Configuration L 1 D: Enhanced Next-N-line L 2 C:

Geo Mean IPC 657. xz_s-2302 B 654. roms_s-842 B 654. roms_s-523 B 654. roms_s-294

Future Works • Better baseline prefetchers for PPF • Interaction between the prefetchers –

Slides: 16

Download presentation

Enhancing Signature Path Prefetching with Perceptron Prefetch Filtering Eshan Bhatia 1, Gino Chacon 1, Elvira Teran 2, Paul V. Gratz 1, Daniel A. Jiménez 3 1 Texas A&M University 2 Texas A&M International University 3 Texas A&M University / Barcelona Supercomputing Center

Introduction Design Space: Standalone L 1 D, L 2 C and LLC Prefetchers Distribution of hardware budget across three prefetchers Interaction among the prefetchers Control over placing the incoming prefetch line (L 1 D vs L 2 C vs LLC)

Key Ideas • Aggressive L 2 C Prefetching – Signature Path Prefetcher (SPP)[Kim, MICRO ‘ 16] – Perceptron-based Prefetch Filtering (PPF)[Bhatia, ISCA ‘ 19] • Optimizing Prefetch Queue Sharing – Page based resource sharing • Minimal LLC Prefetching – Lack of information – LLC is a shared resource among cores • Coordination between levels – Minimizing impact of noisy prefetches on lower level prefetchers

Page Based Resource Sharing • Prefetch Queue (PQ) limited in number – Valuable resource for L 1 D / L 2 C • Aggressive (but still accurate) prefetching – Takes the current page deep into the speculation path – Blocks PQ resources for other pages – Timing disparity between multiple pages with interleaved accesses • Efficient Resource Utilization – Track number of distinct pages in last few memory accesses – Divide PQ resource over those pages

L 1 D Prefetcher: Next-N-Lines • Fetches N consecutive lines wrt current demand address – N determined through PQ resource availability • Page level throttling – Tracks per page access pattern for the last two accesses – Scores page as +1 delta friendly or averse – Throttles prefetching for averse pages

L 2 C Underlying Prefetcher: SPP • Lookahead Prefetcher – Uses previous prefetch suggestion to trigger new speculation – Recursively iterate and keep compounding the confidence – Stop when the confidence falls below a certain threshold • Threshold (hyperparameter) is an indication of aggressiveness – Less threshold -> more aggressive -> more coverage -> less accuracy – Pre-defined trade-off between coverage and accuracy

Enhanced SPP • Decoupled coverage and accuracy concerns – – SPP enhanced to its most aggressive extreme Helps capture complex memory access patterns Increases coverage Perceptron Filtering (PPF) takes care of accuracy

Hashed Perceptron Model Use feature values to index into distinct tables – Example: PC, memory address etc • Prediction: Lookup, summation, threshold – Use xi value to index into table of corresponding Wi • Learning occurs when ground truth known – Positive Outcome: Increment each feature’s partial prediction weight – Negative Outcome: Decrement each feature’s partial prediction weight • No multiplication, no division, no complex back-propagation •

PPF Architecture • Baseline prefetcher: SPP – Modified for high coverage • Perceptron Weights Tables – Tables of 5 -bit up-down saturating counters – 1 table per feature – Variable depth, independent indexing • Prefetch and Reject Tables – Record prefetches for future training

PPF Design • • Prefetch suggestions tested using PPF Outcome and indexing metadata recorded in Prefetch / Reject Table • • Subsequent feedback of a prior prefetch Same perceptron weights re-indexed and updated by +1 / -1

Putting Pieces Together Single Core Configuration L 1 D: Enhanced Next-N-line L 2 C: PPF with SPP – Triggered on all accesses to L 2 C – Can place prefetches in L 2 C or LLC: Next Line prefetcher – Triggered on demand accesses and only last prefetch from L 1 D reaching LLC – Uses the metadata communication path between the prefetchers Multi Core Configuration L 1 D: No Prefetching L 2 C: PPF with SPP – Triggered on all accesses to L 2 C – Can place prefetches in L 2 C or LLC: SPP (without PPF) – Separate tables for each core – Modified to be less aggressive than the original SPP (LLC is a shared resource) Overhead: 49. 94 KBs Overhead: 62. 83 KBs

Geo Mean IPC 657. xz_s-2302 B 654. roms_s-842 B 654. roms_s-523 B 654. roms_s-294 B 654. roms_s-293 B 654. roms_s-1613 B 654. roms_s-1390 B 654. roms_s-1070 B 654. roms_s-1007 B 649. fotonik 3 d_s-8225 B 649. fotonik 3 d_s-7084 B 649. fotonik 3 d_s-1176 B 649. fotonik 3 d_s-1088. . . 644. nab_s-12521 B 628. pop 2_s-17 B 627. cam 4_s-490 B 623. xalancbmk_s-202. . . 623. xalancbmk_s-10 B 621. wrf_s-8065 B 621. wrf_s-6673 B 621. wrf_s-575 B 620. omnetpp_s-874 B 620. omnetpp_s-141 B 619. lbm_s-4268 B 619. lbm_s-3766 B 619. lbm_s-2677 B 619. lbm_s-2676 B 2. 97 607. cactu. BSSN_s-40. . . 607. cactu. BSSN_s-34. . . 607. cactu. BSSN_s-24. . . 605. mcf_s-994 B 605. mcf_s-782 B 605. mcf_s-665 B 605. mcf_s-484 B 605. mcf_s-472 B 605. mcf_s-1644 B 605. mcf_s-1554 B 605. mcf_s-1536 B 605. mcf_s-1152 B 603. bwaves_s-891 B 3. 59 603. bwaves_s-2931 B 603. bwaves_s-2609 B 603. bwaves_s-1740 B 602. gcc_s-734 B 602. gcc_s-2226 B 2. 5 2. 3 2. 1 1. 9 1. 7 1. 5 1. 3 1. 1 0. 9 602. gcc_s-1850 B Results Improvement reported over no prefetching Single Core: 40. 4% Multi Core: 20. 3% Single Core Speedup

Future Works • Better baseline prefetchers for PPF • Interaction between the prefetchers – Metadata communication path between the levels

Thank you!

Backup Slides

L 2 C Underlying Prefetcher: SPP