Notes on an actor language Jrn W Janneck

  • Slides: 27
Download presentation
Notes on an actor language Jörn W. Janneck Xilinx Inc. 13 February 2007 –

Notes on an actor language Jörn W. Janneck Xilinx Inc. 13 February 2007 – 7 th Ptolemy Miniconference

CAL Actor Language • scripting actor specifications CAL @ Ptolemy • the language •

CAL Actor Language • scripting actor specifications CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application – make it easier to write atomic actors • experimenting with domain polymorphism • (code generation)

actors in CAL guarded atomic actions CAL @ Ptolemy • the language • domain-dependent

actors in CAL guarded atomic actions CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application Actions State encapsulated state

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application simple actors actor Sum () Input ==> Output: sum : = 0; actor Sum. Abs () Input ==> Output: action [a] ==> [sum] do sum : = sum + a; end sum : = 0; end action [a] ==> [sum] guard a >= 0 do sum : = sum + a; end Input action [a] ==> [sum] guard a < 0 do sum : = sum - a; end Sum. Abs Sum Output end

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application nondeterminism actor NDMerge () Input 1, Input 2 ==> Output: action Input 1: [x] ==> [x] end action Input 2: [x] ==> [x] end Input 1 Input 2 NDMerge Output

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application data-dependent token flow actor Select () S, A, B ==> Output: action S: [sel], A: [v] ==> [v] guard sel end action S: [sel], B: [v] ==> [v] guard not sel end S A B Select Output

CAL and domain polymorphism • CAL @ Ptolemy • the language • domain-dependent interpretation

CAL and domain polymorphism • CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application two fundamental questions: 1. Can an actor be interpreted/used in a given Mo. C? 2. What is its interpretation? Þ domain-specific interpretation

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application Example: SDF actor Add () Input 1, Input 2 ==> Output: action [a], [b] ==> [a + b] end Input 1 Input 2 actor Add. Seq () Input ==> Output: action [a, b] ==> [a + b] end Input 1 1 Add Output 1 2 1 Add. Seq Output

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application Example: SDF (cont’d) actor NDMerge () Input 1, Input 2 ==> Output: action Input 1: [x] ==> [x] end action Input 2: [x] ==> [x] end Input 1 Input 2 NDMerge Output end actor Merge () Input 1, Input 2 ==> Output: action [x 1], [x 2] ==> [x 1, x 2] end Input 1 Input 2 1 2 Merge 1 Output

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application Some kind of “synchronous”. . . 1 1 F 1 1 2 NDMerge 1 1 2 A 1 1

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application Example: CSP actor NDMerge () Input 1, Input 2 ==> Output: action Input 1: [x] ==> [x] end action Input 2: [x] ==> [x] end [ Input 1 ? x -> Output ! x || Input 2 ? x -> Output ! x ] end actor Add () Input 1, Input 2 ==> Output: Input 1 ? a -> Input 2 ? b -> Output ! a + b action [a], [b] ==> [a + b] end [ Input 1 ? a -> Input 2 ? b || Input 2 ? b -> Input 1 ? a ] ; Output ! a + b

Example: CSP (cont’d) actor Select () S, A, B ==> Output: action S: [sel],

Example: CSP (cont’d) actor Select () S, A, B ==> Output: action S: [sel], A: [v] ==> [v] guard sel end action S: [sel], B: [v] ==> [v] guard not sel end CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application S ? sel; [ sel -> A ? v -> Output ! v || not sel -> B ? v -> Output ! v ] actor A () X, Y ==> Z: action X: [x 1, x 2] ==> [f(x 1, x 2)] guard P(x 1, x 2) end action Y: [y 1, y 2] ==> [f(y 1, y 2)] guard P(y 1, y 2) end ?

CAL and dataflow at Xilinx software CAL @ Ptolemy • the language • domain-dependent

CAL and dataflow at Xilinx software CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application class My. Actor { schedule(); read. Port( port. Num ); write. Port( port. Num ); } actor source + network simulation hardware high-level synthesis new FPGA programming model & tools • hardware code generation • software (& mixed) code generation driver application • MPEG 4 Simple Profile Decoder MPEG standardization effort • ISO/IEC 23001 -4 (working draft): Codec Configuration Representation • ISO/IEC 23002 -4 (working draft): Video Tool Library

FPGA Programming In Practice Networked MPEG-4 Viewer Microblaze Ethernet running LWIP UDP protocol stack

FPGA Programming In Practice Networked MPEG-4 Viewer Microblaze Ethernet running LWIP UDP protocol stack Decoder Actor Network Raster Scan Actor XUP Board (2 VP 30) VGA Display IP IP Memory Controller UDP over Ethernet Remote Video Stream Server CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application Local VGA Monitor

MPEG-4 SP Decoder quality of compiled code Versio n VHDL IP 1 (15000 lines)

MPEG-4 SP Decoder quality of compiled code Versio n VHDL IP 1 (15000 lines) Area Slice 4637 LUT 7923 FF 2637 BRA M 26 2 Performance MULT 34 CAL decoder 3872 7720 3576 22 3 7 1 http: //www. xilinx. com/bvdocs/ipcenter/data_sheet/ds 520_prod_brf. pdf (4000 lines) 2 3 CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application BRAM-limited to 4 -CIF image size. Supports HD image size. Reduces to 16 BRAMs for 4 -CIF image size 180 K macroblock/s @ 100 MHz Requires ZBT SRAM framebuf HD image size 243 K macroblock/s @ 120 MHz Interfaces to DRAM framebuf I-frame parsing: 50 Mbit/s

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview

CAL @ Ptolemy • the language • domain-dependent interpretation CAL @ Xilinx • overview • application comparing decoder solutions relative area efficiency 10 5 b 2 d 1 a CIF SD 10 c HD 1000 a TI 64 xx MPEG-4 (CPU + L 1 cache only) b ISSCC’ 06 H. 264 capable (includes periphery) c FPGA MPEG-4 using traditional HDL flow (12 MM effort) d FPGA MPEG-4 using actor/dataflow synthesis (3 MM effort) throughput macroblocks/sec x 1000

Thank You. Credits: Dave B. Parlour, Ian D. Miller, Johan Eker, Edward A. Lee,

Thank You. Credits: Dave B. Parlour, Ian D. Miller, Johan Eker, Edward A. Lee, and many others. CAL actor language: embedded. eecs. berkeley. edu/caltrop

BACKUP

BACKUP

programming language adoption Name C C++ Perl Python VB Delphi Java PHP Java. Script

programming language adoption Name C C++ Perl Python VB Delphi Java PHP Java. Script C# TPCI 17. 66% 11. 06% 5. 48% 3. 47% 9. 73% 2. 15% 21. 17% 9. 86% 2. 20% 3. 07% TPCI cum. 17. 66% 28. 73% 34. 20% 37. 67% 47. 40% 49. 54% 70. 72% 80. 58% 82. 78% 85. 85% Year 1973 1985 1987 1990 1991 1994 1995 2002 100 cumulative TCPI by language creation date Java PHP Java. Script (for top 10 languages) 50 VB Perl C# Delphi Python C++ C 1970 1975 1980 1985 1990 source: TIOBE Programming Community Index, TPCI, October 2006, http: //www. tiobe. com/tpci. htm 1995 2000 2005

Smaller, Faster, Easier Too good to be true? • This is what happens when

Smaller, Faster, Easier Too good to be true? • This is what happens when design effort is constrained. • The key is enabling architectural exploration with rapid turn-around time. • New decoder architecture incorporates many improvements over original design in motion compensation, AC/DC reconstruction, parser, 2 -d IDCT. • Approximate manpower numbers: – VHDL decoder: 12 months – Dataflow decoder: 3 months

Architectural Exploration MPEG 4 Motion Compensator video stream feedback video frame buffer (off-chip DRAM)

Architectural Exploration MPEG 4 Motion Compensator video stream feedback video frame buffer (off-chip DRAM) PROBLEM! Memory latency for random access reads and writes prevents real-world operation at HD rates.

First Step: Try on-chip cache • • Break the address and data streams, insert

First Step: Try on-chip cache • • Break the address and data streams, insert a cache placeholder. Insert different policies, see what happens. policy 1 Pass-through just to make sure model is OK. policy 2 Insert a cache actor in the read path and monitor statistics.

Simulation result with policy 2 Monitor console Frame 1 OK time: Frame 2 OK

Simulation result with policy 2 Monitor console Frame 1 OK time: Frame 2 OK time: Requests: 49456, Miss rate: 8. 28% Frame 3 OK time: Requests: 98704, Miss rate: 8. 30% 28111 ms 23834 ms Hits: 45360 27369 ms Hits: 90512 • Memory controller performance 133 MHz clock 32 pixel cache line fill in ~18 cycles • Worst case compensation is 81 reads for an 8 x 8 block. • 8. 3% miss rate implies average read is ~ 2. 4 cycles • • Rate limit is 44 Mpixel/s • Options for improvement - more expensive controller - much better cache policy - application-aware prefetch HD (1920 p, 4: 2: 0, 30 fps) rate target is 93. 3 Mpixel/s

Step 2: Application-aware prefetch requests to frame buffer prefetch data replace cache with “search

Step 2: Application-aware prefetch requests to frame buffer prefetch data replace cache with “search window” compensation addresses now relative to search window senses block type

Results of prefetch strategy • Better performance – prefetch needs to operate at 3

Results of prefetch strategy • Better performance – prefetch needs to operate at 3 x pixel rate – exploits longer burst read with application- awareness (longer cache line did not help policy 2 significantly) – 64 pixels in 26 cycles → average read is ~ 0. 4 cycles – peak theoretical performance is 111 Mpixel/s – exceeds HD rate target with cheap DRAM • Substantial change to overall model behavior, but

The FPGA programming problem • Big, heterogeneous chips • circuit-design programming (+ C, Simulink,

The FPGA programming problem • Big, heterogeneous chips • circuit-design programming (+ C, Simulink, . . . ) 1985: 128 4 -LUTs 2006: [V 5 -LX] 207360 6 -LUTs 10 Mbit BRAM 192 ALUs