ECE 408CS 483 Fall 2015 Applied Parallel Programming

  • Slides: 7
Download presentation
ECE 408/CS 483 Fall 2015 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis ©

ECE 408/CS 483 Fall 2015 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2012 1

2 D Output Tiling • Use a thread block to calculate a tile of

2 D Output Tiling • Use a thread block to calculate a tile of P row_o = block. Idx. y*TILE_WIDTH + ty; – Each output tile is of TILE_SIZE for both x and y col_o = block. Idx. x * TILE_WIDTH + tx; © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2012 2

Input tiles need to cover halo elements. 3 2 1 2 0 4 3

Input tiles need to cover halo elements. 3 2 1 2 0 4 3 2 3 1 5 4 3 5 1 6 5 4 6 3 7 6 5 7 1 Mask_Width = 5 Input Tile Output Tile 3 2 1 2 0 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2012 4 3 2 3 1 5 4 3 5 1 6 5 4 6 3 7 6 5 7 1 3

A Simple Analysis for a small 8 X 8 output tile example • 12

A Simple Analysis for a small 8 X 8 output tile example • 12 X 12=144 N elements need to be loaded into shared memory • The calculation of each P element needs to access 25 N elements • 8 X 8 X 25 = 1600 global memory accesses are converted into shared memory accesses • A reduction of 1600/144 = 11 X © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2012 4

In General • (Tile_Width+Mask_Width-1) 2 elements of N need to be loaded into shared

In General • (Tile_Width+Mask_Width-1) 2 elements of N need to be loaded into shared memory • The calculation of each element of P needs to access Mask_Width 2 elements of N • Tile_Width 2 * Mask_Width 2 global memory accesses are converted into shared memory accesses • The reduction is Tile_Width 2 * Mask_Width 2 / (Tile_Width+Mask_Width-1) 2 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2012 5

Bandwidth Reduction for 2 D • The reduction is Mask_Width 2 * (Tile_Width) 2

Bandwidth Reduction for 2 D • The reduction is Mask_Width 2 * (Tile_Width) 2 /(Tile_Width+Mask_Size-1) 2 Tile_Width 8 16 32 64 Reduction 11. 1 Mask_Width = 5 16 19. 7 22. 1 Reduction 20. 3 Mask_Width = 9 36 51. 8 64 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2012 6

ANY MORE QUESTIONS? READ CHAPTER 8 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE

ANY MORE QUESTIONS? READ CHAPTER 8 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE 408/CS 483/ECE 498 al University of Illinois, 2007 -2012 7