A Fast GeneralPurpose Clustering Algorithm Based on FPGAs

A Fast General-Purpose Clustering Algorithm Based on FPGAs for High. Throughput Data Processing A. Annovi, M. Beretta, P. Laurelli, G. Maccarrone, A. Sansoni INFN - Frascati FRONTIER DETECTORS FOR FRONTIER PHYSICS 11 th Pisa meeting on advanced detectors 24 -30 May 2009 Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 1

Why 2 D pixel clustering? • Several applications for Pixel detectors in general – Particle tracking (including trigger) – Medical imaging – Astrophysics image acquisition etc. SDSS • All can benefits from a high-throughput clustering • The clustering algorithm has several functions – – Improves resolution (e. g. spatial or other parameters) Reduce the amount of data (N hits 1 cluster) Identifies objects (e. g. for medical imaging) Perform advanced cluster shape analysis [A. Retico et al. , Comput Biol Med. 2008 Apr; 38(4): 525 -34] Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 2

Imagine a self clustering detector ATLAS pixel layer 2 Thousands of fibers sending raw-data…. Imagine a device that clusters data on the fly: 1. Connected to the remote fiber end (off detector) 2. Or directly on the pixel front-end electronics [2008 JINST 3 P 07007] Front-end with partial clustering ATLAS Insertable B-Layer (see H. Pernegger Friday) Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 3

A real application: clustering for the ATLAS Fast Trac. Ker • Pixel clustering device for the ATLAS Fast. Trac. Ker processor – 1 st application & design motivation http: //twiki. cern. ch/twiki/bin/view/Atlas/Fast. Tracker • Main challenge: input rate 160 Gibts – 132 S-link fibers from all pixel RODs • Running at 1. 2 Gbits (total 160 Gbits) • 32 bit words at 40 MHz, 1 hit/word – Use hits at 40 MHz as benchmark • Focus on clustering quality for level-2 • Illustrate a general clustering strategy Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati Pixel. Detector interface Fast Trac. Ker input stage 50~100 k. Hz event rate 132 S-links clustering device FTK reconstructs tracks for level 2 Level-2 Event buffers 4

The problem 3 9 1 4 • Clustering is a 2 D problem 1. Associate hits from same cluster 7 13 15 8 6 – Loop over hit list – Time increases with occoupancy & instantneous luminosty – Non linear execution time 11 12 2 5 10 2. Calculate cluster properties 14 – e. g. center, size, shape … • Goal: execution time linear with number of hits Loop over list of hits 1 2 3 4 5 Pisa Meeting, May 27 th, 2009 6 7 8 9 – Not a limiting factor even at highest inst. Luminosity 1 0 Alberto Annovi - INFN Frascati 5

The algorithm working principle FPGA replica of pixel matrix Eta direction --> 1 st phase: The pixel module is a 328 x 144 matrix. Replicate it in a hardware matrix. The matrix identifies hits in the same cluster (local connections). 2 nd phase: Hits in cluster are analyzed (averaged). Flexibility to choose algorithm! Loop over events and pixel modules Load all module hits select left most top most hit propagate selection through cluster read out cluster Loop over clusters in a module Core logic: Hit associated into clusters 2 nd pipeline stage high level cluster analysis Average calculator out

Core logic Row index Logic functions 1. Load hits 2. Select left-top-most hit 3. Propagate “selected” 4. Readout cluster Column index Load hits regardless of readout order. Any readout order is allowed. Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 7

Core logic Find left-most top-most hit Priority logic Control logic MANY levels of logic (328+144) Will determine algorithm (clock) speed. (See slide # 16) Pisa Meeting, May 27 th, 2009 Row index Logic functions 1. Load hits 2. Select left-top-most hit 3. Propagate “selected” 4. Readout cluster Column index Alberto Annovi - INFN Frascati 8

Core logic Propagate “selected”: local logic Logic functions 1. Load hits 2. Select left-top-most hit 3. Propagate “selected” Black pixel 4. Readout cluster Row index – Column index Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 9

Core logic Readout cluster Logic functions 1. Load hits 2. Select left-top-most hit 3. Propagate “selected” Control logic Black pixel 4. Readout cluster – Black pixels (3) Select Propagation in parallel with (4) Cluster Readout Row index • Column index Priority logic

time Select & readout in parallel 0 th clock cycle (3) Select Propagation in parallel with (4) Cluster Readout LEGEND: HIT pixel SELECTED pixel readout 2 nd clock cycle selection 1 st clock cycle 3 rd clock cycle 4 th clock cycle READOUT pixel Works for any cluster shape Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 11

Two priority logic chains hit Control logic hit select pixel cell hit hit a select pixel cell hit priority logic - needed to select first hit sel a 2 nd priority logic - needed to readout selected hits (cluster) - position from address bus select X 328 pixels in a column and 144 columns Pisa Meeting, May 27 th, 2009 1 st This logic selects the top most pixel. Similar logic to select the left most column with a hit. Alberto Annovi - INFN Frascati 12

The elementary cell Cluster definition: 3 STATES (2 FLIP-FLOPS): Contiguous hits along IS_EMPTY side or corner SEL HIT Flexible “cluster definition” IS_HIT Flexibility to redefine it IS_SELECTED st nd 1 &2 neighborhood IS_SELECTED IS_HIT st Combinatorial 24 Combinatorial 1 neighborhood clk logic IS_SELECTED logic Condensed Pulse height 8 SEL FOR READOUT IS_SELECTED WRITE ROW SEL clk AND COLUMN SEL Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 9 Row addr Bus (output) 13

Clustering resolution w. r. t. offline 2 nd pipeline stage high level cluster analysis Average calculator out Including collected charge average: RMS rf ~ 0. 002 pixels (~0. 1 mm) RMS z ~ 0. 05 pixels (~20 mm) Offline accounts for track angle. Without collected charge (digital): Clusters are found close to offline RMS ~0. 1 pixels RMS rf ~ 0. 07 pixels (~4 mm) RMS z ~ 0. 12 pixels (~50 mm) Pixels are 50 x 400 mm Pisa Meeting, May 27 th, 2009 MC single muons Note log Z scale rf, z residuals w. r. t. offline Distances in unit of pixels Alberto Annovi - INFN Frascati 14

Implemented & simulated • Using xilinx virtex 5 (xc 5 vlx 330) • Timing (cycles): 2/hit + 2/cluster – could be reduced to: 2/hit Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 15

Resources and clock speed xc 5 vlx 330 FPGA usage and clock period increase for large matrixes. For a 328 x 144 matrix, area usage ~250%. Now what? Take advantge of readout order (depend on actual detector). Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 16

scrambled 50 mm / pixel (rf coordinate) ATLAS Pixel module Readout order direction 400 mm / pixel (z coordinate) 328 x 144 pixels readout by 16 FE chips Partially sorted readout. Hits within one double-column (half-legnth) are scrambled. Double-columns are readout in order. FE chips are readout in order All details in [2008 JINST 3 P 07007] Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 17

Cluster shapes Most clusters are smaller than 5 x 3 pixels Safely fit in a matrix 8 pixel wide along h B layer Layer 1 18 Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati

How to save logic resources? • Take advantage of partial readout ordering • Use a sliding window to process the complete module • Use a 328 x 8 matrix Sliding window • Full rf length (can be squeezed) • Larger than maximum cluster size (5) – Clock freq. 58 MHz • 3 cycles/hit ~ 20 MHz hit proc. rate – Area usage 32% (xc 5 vlx 155) – Use 2 matrixes 64% of xc 5 vlx 155 safely process one 40 MHz Slink Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati Far away clusters are unrelated

Clustering by 328 x 8 slices? Module data Fill 328 x 8 slice like this And so on Eta direction --> Read out 1 st cluster Read out 2 nd cluster Shift of hits comes for free (no extra time)! Just use the slice as a circular buffer in the eta direction. Then hits are shifted by redefining the first column. SLIDING WINDOW: with one xc 5 vlx 155 process one S-Link Implement 2 processing matrixes. Process hits at 40 MHz rate. Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 20

Another clustering algorithm Logic matrix LEGEND: Apply mask to Center of cluster all pixels in parallel Make cluster and find center @ once No need for long priority chains. Caveats: center is approximate, rarely splits clusters. Clustering executed in parallel: • ideal for level-1 applications • ideal for on detector applications ALGORITHM RULES: • turn off “external hits” • for each hit delete rule is : turn off “hit” IF • one of “red cell is hit” AND • all “white cells” are off • looping over this list of “rules/masks” until only single hits are left List of rules/masks can be changed Pisa Meeting, May 27 th, 2009 21

Conclusions • Developed a clustering algorithm for the Fast Tracker – Full resolution with linear processing time • The algorithm is fully general – Could do 3 D clustering – Re-usable for other applications – Can employ flexible “contiguity/cluster” definition – For the Fast Tracker contiguity is by side or corner • Proposing also a fully parallel algorithm – Level 1, On detector applications Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 22

BACKUP SLIDES Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 23

Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 24

Firmware diagram To. T = time over threshold RAM store To. T ~21 kbits (328 x 8 x 8 bits) Row, Col, To. T input To. T Average calculator Output average (x, y) cluster centers FSM & control logic FIFO Row, Col 328 x 8 processing matrix Core logic Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati Row, Col by cluster 25

FPGA cost (feb 2009) • • • Today avnet quotes for 1 piece xc 5 vlx 330 --> 9600$ up (used for implementation) xc 5 vlx 220 --> 3900$ up xc 5 vlx 155 --> 2300$ up xc 5 vlx 110 --> 1500$ up Logic proportional to last number – e. g. 330 --> 330000 logic cells • In order to process 1 S-Link at 40 MHz – Need two grids (328 x 8) • Equivalent to ~ 110000 logic cells – Plus surrounding logic and safety margin • Example: choose 1 xc 5 vlx 155 per S-Link – Area usage > 66% from the 2 grids – Need 120 FPGAs --> 276 k$ at today price Pisa Meeting, May 27 th, 2009 Alberto Annovi - INFN Frascati 26