ALICE Week Technical Board TPC Intelligent Readout Architecture. Volker Lindenstruth Universität Heidelberg

ALICE Week 17.11.99 Technical Board TPC Intelligent Readout Architecture Volker Lindenstruth Universität Heidelberg

What s new?? TPC occuppancy is much higher than originally assumed New Trigger Detector TRD First time TPC selective readout becomes relevant New Readout/L3 Architecture No intermediate buses and buffer memories - use and local memory instead New dead-time or throtteling architecture

TRD/TPC Overall Timeline TEC drift Track segment processing track matching 0 1 2 3 4 5 event TRD pretrigger data sampling, linear fit end of TEC drift Data shipping off detector TRD trigger at L1 Trigger at TPC (Gate opens) Time in ms

TPC L3 trigger and processing Front- End / Trigger ~2 khz TPC intelligent Readout TRD Trigger L0 Global Trigger Select Regions of Interest Tracking of e+/e- Candidates inside TPC L1 Trigger and readout TPC L2 L2 Other Trigger Detectors, (144 Links, ~60 MB/evt evt) TRD L0pre Ship Zero suppressed TPC Data Sector parallel Ship TRD e+/e- Tracks seeds Conical zero suppressed readout On- line data reduction ( tracking, reconstruction, partial readout, data compression) Reject event Verify e+/e- e+/e- Tracks Hypothesis plus RoIs Track segments and space points DAQ

Architecture from TP 10 4 Hz Pb-Pb 10 5 Hz p-p Event Rate TPC DDL ITS PID PHOS TRIG Trigger Data L0 Trigger LDC 2500 MB/s Pb+Pb 20 MB/s p+p LDC LDC LDC LDC LDC Switch G D C G D C G D C G D C STL EDM BUSY L1 Trigger 10 3 Hz Pb-Pb 10 4 Hz p-p 1.5-2 µs L2 Trigger 50 Hz zentral + 1 khz dimuon Pb-Pb 550 Hz p-p 10-100 µs 1250 MB/s Pb+Pb 20 MB/s p+p Switch PDS PDS PDS PDS

Some Technology Trends Kapazität Geschwindigkeit (Latenz) Logic: 2x in 3 years 2x in 3 Jahren DRAM: 4x in 3 years 2x in 15 Jahren Disk: 4x in 3 years 2x in 10 Jahren D R AM Jahr Size Cycle Time 1000:1! 2:1! 1 9 8 0 6 4 K b 250 ns 1 9 8 3 2 5 6 K b 220 ns 1 9 8 6 1 Mb 190 ns 1 9 8 9 4 Mb 165 ns 1 9 9 2 1 6 Mb 145 ns 1 9 9 5 6 4 Mb 120 ns...

Prozessor-DRAM Memory Gap Performance 1000 100 10 1 DRAM µproc 60%/yr. (2X/1.5yr) Processor -M e m o r y Performance Gap: (g r o w s 50% / year) DRAM 6%/yr. (2X/15 yrs) 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Moore s Law Time Volker Dave Lindenstruth, Patterson, November UC Berkeley 1999

Testing the uniformity of memory // Vary the size of the array, to determine the size of the cache or the // amount of memory covered by TLB entries. for (size = SIZE_MIN; size <= SIZE_MAX; size *= 2) { // Vary the stride at which we access elements, // to determine the line size and the associativity for (stride = 1; stride <= size; stride *=2) { // Do the following test multiple times so that the granularity of the // timer is better and the start-up effects are reduced. sec = 0; iter = 0; limit = size - stride + 1; iterations = ITERATIONS; do { sec0 = get_seconds(); for (i = iterations; i; i--) // The main loop. // Does a read and a write from various memory locations. for (index = 0; index < limit; index += stride) *(array + index) += 1; iteration stride stride stride size Address sec += (get_seconds() - sec0); iter += iterations; iterations *= 2; } while (sec < 1); stride

360 MHz Pentium MMX 190 ns 3 2 bytes 4 0 9 4 bytes L1 Instruct. Cache: 1 6 kb L1 Data Cache: 1 6 kb (4-way associative, 16Byte line) L2 Cache: 5 1 2 kb (unified) MMU: 32 I / 64D TLB (4-way assoc) 9 5 n s 2.7 ns

360 MHz Pentium MMX L2 Cache off All Caches off

Vergleich zweier Supercomputer HP V - Class ( P A - 8x00) SUN E10k ( UltraSparc II) L1 Instruct. Cache: 5 1 2 kb L1 Data Cache: 1 0 2 4 kb (4-way associative, 16Byte line) MMU: 160 fully assoc. TLB L1 Instruct. Cache: 1 6 kb L1 Data Cache: 1 6 kb (write-through, non allocate, direct mapped,32byte line) L2 Cache: 5 1 2 kb (unified) MMU: 2x64 fully assoc. TLB

LogP P (Prozessoren) o ( overhead) P M P M o ( overhead) P M g ( gap ) g ( gap ) L (Latenz) Verbindungs -Netzwerk Volume limited by L/g (aggregate Throughput) L: : Time, a packet travels in the network from sender to receiver o: : overhead to send or receive a message g: shortest time between sent or received message P: Number of processors : Network Interface Card Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; PPOPP, May 1993

2-Node Ethernet Cluster Gigabit Ethernet Quelle: Intel Gigabit Ethernet with Carrier Extension Fast Ethernet (100 Mb/s) Test: SUN Gigabit Ethernet Karte IP 2.0 2 SUN 450 Ultra Server 1 each Sender produces TCP datastream with large Data buffers; ; receiver simply throws data away Prozessor Utilization: Sender 40%; Receiver 60%! Throughput ca. 160 Mbits! Netto Throughput increases if receiver is implemented as twin processor Why is the TCP/IP Gigabit Ethernet performance so much worse than the theoretically possible?? Note: CMS implemented their own propriate network API for Gethernet and Myrinet

First Conclusions - Outlook Memory Bandwidth is the limiting and determining factor. Moving Data requires significant memory bandwidth. Number of TPC Data links dropped from (528 ) to 180 Aggregate data rate per link ~34 MB/sec @ 100 Hz TPC has highest processing requirements - majority of TPC computation can be done on per sector basis. Keep the number of s that process one sector in parallel to a minimum Today this number is 5 due to TPC granularity Try to get Sector data directly into one processor Selective Readout of TPC sectors can reduce data rate requirement by factors of at least 2-5 Overall complexity of L3 Processor can be reduced by using based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME crates combining the data from multiple TPC links. DATE already uses a GSM paradigm as memory pool - no software changes

Receiver Card Architecture Push readout Pointers Data FiFo Optical Receiver Multi Event Buffer FPGA 66/64 Hostbridge Host memory

Readout of one TPC sector c a v e c o u n t i n g h o u s e RcvBd RcvBd RcvBd L3 Network RcvBd x2x18 Receiver Processor Each TPC sector is readout by four optical links, which are fed by a small derandomizing buffer in the TPC front-e n d. The optical receiver modules mount directly i n a commercial off the shelf (COTS) receiver computer in the counting house T h e COTS receiver processor performs any necessary hit level functionality on the data in case of L 3 processing The receiver processor can also perform loss less compression and simply forward it to DAQ implementing the TP baseline functionality. The receiver processor is much less expensive than any crate based solution

Overall TPC Intelligent Readout Architecture Inner Tracking System DDL LDC/ Muon Tracking Chambers LDC/ Particle Identification Photon Spectrometer LDC/ Switch 36 TPC Sectors LDC/L3 LDC/L3 LDC/L3 LDC/L3 LDC/L3 LDC/L3 LDC/L3 LDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 L3 Matrix GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 GDC/L3 Computer center Trigger Decisions Detector busy Trigger Detectors: LDC/ Micro Channelplate - Zero-Degree Cal. - Muon Trigger Chambers - Transition Radiation Detector EDM L0 Trigger L1 Trigger L2 Trigger Trigger Data Each TPC sector forms an independent sector cluster The sector clusters merge through a cluster interconnect/network to a global processing cluster. The aggregate throughput of this network can be scaled up to beyond 5 GB/sec at any point in time allowing to fall back to simple loss less binary readout All nodes in the cluster are generic COTs processors, which are acquired at the latest possible time All processing elements can be replaced and upgraded at any point in time The network is commercial The resulting multiprocessor cluster is generic and can be u s e d a s o f f -line farm PDS PDS PDS PDS

Dead Time / Flow Control TPC Buffer (8 black Events) TPC reveiver Buffer > 100 Events High w a t e r mark - send XOFF Scenario I TPC Dead Time is determined centrally For every TPC trigger a counter is incremented For every completely received event the last receiver module produces a message (single bit pulse), which is forwarded through all nodes after they also received the event The event receipt pulse decrements the counter The counter reaching count 7 asserts TPC dead time (there could be an other event already in the queue Event Receipt Daisy Chain RcvBd l o w w a t e r mark - send XOFF Scenario II TPC Dead Time is determined centrally based on rates assuming worst case event sizes Overflow protection for buffers: Assert TPC BUSY if 7 events within 50 ms (assuming 120 MB/event, 1 Gbit) Overflow protection for receiver buffers: ~100 Events in 1 second - OR high- water mark in any receiver buffer (preferred way) No need for reverse flow control on optical link No need for dead time signalling at TPC frontend

Summary Memory bandwidth is a very important factor in designing high performance multi processor systems; it needs to be studied in detail Do not move data if not required - moving data costs money (except for some granularity effects) Overall complexity can be reduced by using based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME General purpose COTS processors are less expensive than any crate solution FPGA based receiver card prototype is built, NT driver completed, Linux driver almost completed DDL already planned as version No reverse flow control required for DDL DDL URD to be revised by collaboration ASAP No dead time or throtteling required to be implemented at front-end Two scenarios as to how to implement it for the TPC at back-end without additional cost