An Energy Efficient Timesharing Pyramid Pipeline for Multiresolution

Slides: 3

An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA Applications of Multi-resolution Processing Background: Linear Pipeline and Segment Pipeline Our Approach: Time-Sharing Pipeline Architecture A combined approach: § The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner § Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel § Each work-cycle, compute -> 1 pixels for G 2 (coarsest level), -> 4 pixels for G 1 -> 16 pixels for G 0 (finest level) -> next cycle, back to G 2 and so forth • + Less demand of off-chip memory bandwidth • - Poor efficient use of the PE resources • - Area and power overhead § Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another § One single PE runs at full speed as segment pipeline • + Save computational resources • - Require very high memory bandwidth § As low memory traffic as linear pipeline Time-Sharing Pipeline in Gaussian Pyramid and Laplacian Pyramid Time-sharing Pipeline in Optical Flow Estimation (L-K) Line Buffer, Sliding Window Registers and Blocklinear § Sliding window operations § Pixels are streaming into the on-chip line buffer for temporal storage G 0 Blocklinear Image Processing § Three time sharing pipeline work simultaneously: Gaussian Pyramid § § Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced with other processing elements for a more complicated multiresolution pyramid system Laplacian Pyramid Hardware Synthesis in 32 nm CMOS Genesis-based chip generator encapsulates all the parameters (e. g. , window size, pyramid levels) and allows the automated generation of synthesizable HDL hardware for design space exploration Block diagram of a convolution-based timesharing pyramid engine (e. g. , 3 -level Gaussian pyramid engine with a 3 x 3 convolution window) • Two for Gaussian pyramids construction (fine to the coarse scale) • One for motion estimation (coarse to the fine scale) § Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Area Evaluation Design points are running at 500 MHz on 32 nm CMOS Memory Bandwidth Evaluation § Significantly reduce the linebuffer size § DRAM traffic is an order of magnitude less than SP Energy saving § TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM § All other intermediate memory traffic is completely eliminated § The overhead of TP over SP is fairly small for designs with small windows Block. Linear Design Evaluation § Line buffer size is proportional to the image width, making the line buffer cost for high resolution images huge § Inspired by the GPU block-linear texture memory layout § The cost of extra shift registers and controlling logic for time-sharing configurations. Pipeline are negligible Time-Sharing (TP) compared with the reduction vs. Segment Pipeline (SP) of the PE cost § TP consumes increasingly more area compared to SP as the pyramid levels grow § TP is almost 2 x faster than SP § TP is only slightly slower than LP while eliminating all the logic costs § Energy consumption is dominated by DRAM accesses § vs. SP: 10 x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost § vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing Data refetch at boundary Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) § TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels Hardware chip generator GUI Overall Performance & Energy Evaluation Linebuffer width is equal to the block width Simulation Result http: //www. c 2 s 2. org P(N) = Parallel Degree. B(N) = Number of Blocks. § Increase number of blocks reduces linebuffer area, while remains the same throughput § This chart demonstrates various design trade-offs § Optical flow (velocity) on a benchmark image with a left-to-right movement § The proposed TP-based implementation produces the same motion vectors as the SP-based implementation, validating the approach

An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA Applications in Multi-resolution Processing Existing Solutions: Linear Pipeline and Segment Pipeline § Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel • Pro: Less demand of off-chip memory bandwidth Panorama Stitching • Con: HDR o Poor efficient use of the PE resources o Area and power overhead § Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level Detail Enhancement after another Optical Flow • Pro: Save computational resources • Con: Require very high memory bandwidth Proposed Solution: Time-sharing Pipeline § The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner § Each work-cycle, compute -> 1 pixels for G 2 (coarsest level), -> 4 pixels for G 1 -> 16 pixels for G 0 (finest level) -> next cycle, back to G 2 and so forth § One single PE runs at full speed as segment pipeline § As low memory traffic as linear pipeline Laplacain Pyramid Application Demonstration Hierarchical Lucas-Kanade G 0 Gaussian Pyramid § § Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system Evaluation Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) § TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels § The cost of extra shift registers and controlling logic for timesharing configurations are negligible compared with the reduction of the PE cost § Three time sharing pipeline work simultaneously: • Two for Gaussian pyramids construction (fine to the coarse scale) • One for motion estimation (coarse to the fine scale) § Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Bandwidth Area Simulation Result of Hierarchical Lucas-Kanade http: //www. c 2 s 2. org Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP) § TP consumes increasingly more area compared to SP as the pyramid levels grow § The overhead of TP over SP is fairly small for designs with small windows § DRAM traffic is an order of magnitude less than SP Energy saving § TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM § All other intermediate memory traffic is completely eliminated Power § TP is almost 2 x faster than SP § TP is only slightly slower than LP while eliminating all the logic costs § Energy consumption is dominated by DRAM accesses § vs. SP: 10 x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost § vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing

An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA Applications in Multi-resolution Processing Existing Solutions: Linear Pipeline and Segment Pipeline § Linear pipeline: Replicate processing elements (PEs) for each pyramid level; all PEs work in parallel for all pyramid levels • Pro: Less demand of off-chip memory bandwidth Panorama Stitching • Con: HDR o Inefficient use of the PE resources o Area and power overhead § Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another Detail Enhancement • Pro: Saves computational resources • Con: Requires very high memory bandwidth Optical Flow Proposed Solution: Time-sharing Pipeline § The same PE works for all the pyramid levels in parallel as a time-sharing pipeline § Each work-cycle, compute -> 1 pixel for G 2 (coarsest level) -> 4 pixels for G 1 -> 16 pixels for G 0 (finest level) -> next cycle, back to G 2 and so forth § One PE runs at full speed as a segment pipeline § As low memory traffic as a linear pipeline Laplacian Pyramid Hierarchical Lucas-Kanade Application Demonstration G 0 Gaussian Pyramid § § Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced by other processing elements for a more complicated multiresolution pyramid system Evaluation Area § Three time sharing pipelines work simultaneously: o Two for Gaussian pyramids construction (from fine to coarse scale) o One for motion estimation (from coarse to fine scale) § Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Simulation Result of Hierarchical Lucas-Kanade Optical Flow Power Bandwidth § DRAM traffic is an order of magnitude less than SP http: //www. c 2 s 2. org Energy saving Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) § TP consumes much less PE area § The cost of extra shift registers and controlling logic is negligible compared to the reduction of the PE cost § TP is almost 2 x faster than SP § TP is only slightly slower than LP while eliminating all the logic costs § TP only accesses the source images from the DRAM, and returns the motion vectors back to the DRAM Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP) § TP consumes more area as the pyramid levels grow § The area cost is still competitive in small window § All other intermediate memory traffic is completely eliminated § Energy consumption is dominated by DRAM accesses § vs. SP: 10 x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost § vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing