J. Maillard, J. Silva. Laboratoire de Physique Corpusculaire, College de France. Paris, France

Size: px

Start display at page:

Download "J. Maillard, J. Silva. Laboratoire de Physique Corpusculaire, College de France. Paris, France"

Cora Todd
5 years ago
Views:

1 Track Parallelisation in GEANT Detector Simulations? J. Maillard, J. Silva Laboratoire de Physique Corpusculaire, College de France Paris, France Track parallelisation of GEANT-based detector simulations, worked out for a parallel computer, is described. Its implementation in a T9000 based TN310 computer (16 processors) is explained in detail. With the help of the shower tracking example Gexam1 we show that the speedup obtained is satisfactory for high energies and enough initial particles. Load balancing policies are discussed. 1 Introduction Farm parallelisation provides the best speedup for GEANT-based programs running in MPPs. Event or Job level parallelisation minimizes communication work between dierent processors [1]. However, the minimum time for treating an event is the time expended by a single processor in sequential mode. This time is roughly proportional to the incident track energy, and to the number of initial tracks. For very high energy applications (LHC, astrophysics), the time of a single full event simulation will be very high. It has been noted for a long time, that dierent tracks are independent from each other, and that a parallelisation at the track level could decrease this single-event execution time [2]. Though convenient (specially in the debugging phase), this gain of time is not critical for simulation tasks where there are no severe delay constraints. However, there are a number of reasons that push to undertake this parallelisation. In many of really running parallel computers, the lack of virtual memory forbids the implementation of very complex programs. They are restricted by the limited amount of real memory in each processor. Even if such a capacity exists in a parallel machine, the communication activity with the disk server would grow very fast and the speedup would quickly decrease. The track level parallelisation permits the reduction? This work is supported by the European Union, GPMIMD contract No. 1P5404 Preprint submitted to Elsevier Preprint 4 August 1995

2 of complexity of the single processor task, with the price of reducing global speedup. A geometrical parallelisation is the rst \long time around" example. We also can imagine a track parallelisation where dierent processors take care of tracks with special characteristics. The example of shower tracks simulation with faster algorithms depending on track parameters has already been discussed [3]. So it is important to study how speedup depends on dierent parallelisation schemes and load balancing policies. 2 Parallel implementation In our parallelisation scheme, we divide the problem into one \master" task and N slave tasks. The master task does the basic input/output work, synchronizes event start up and termination. It also acts as a dispatcher for tracks sent by the slaves and nally, it collects all partial results coming from slaves. SLAVE 1 SLAVE 2 MASTER virtual links SLAVE 3 SLAVE 4 HOST SLAVE N Fig. 1. Parallelisation Layout The master task also can act as an event generator, but this may be done either by the slaves themselves or by a special third kind of task. The slave tasks perform tracking. To guarantee the independence of dierent tracks, the random numbers used by the slave are generated by dierent GRNDM random generator series. This implies, however, that an event can not be reproduced exactly. This reproducibility can be implemented by introducing in the track 2

3 information the random seed and the series number. For a given particle, the slave can decide to do track work as a normal sequential program, or instead, to send the track to the master on the basis of some working criteria. There are two important limiting factors to speedup. The production of new particles is done within the tracking of one of them, so the dispatcher can provide work to a free processor only if this production is suciently fast. On the other hand, if there are to many particles exchanged at a given moment, the communication work increases. In a number of cases this can reduce tracking task access to the processor, introduce synchronization delays, and therefore decrease speedup. A good speedup is a compromise between this two situations. MASTER UGINIT EVENT GENERATOR SLAVE GRUN_MASTER GUKINE UGINIT GUWAIT START OF EVENT GUKINE DISPATCHING END OF EVENT COLLECT RESULTS GUTREV receive particle GUSTEP TSKING send particle GUOUT GUOUT UGLAST Fig. 2. Program block diagram From a GEANT programmer point of view, it is important to keep the program structure as close as possible to the sequential one [4]. In our implementation, this is done by preserving must of the sequential program structure, and changing a very limited amount of subroutines. The master can perform the initialization (data card reading, initialization of geometry and detector description) and send it to slaves. This work, of course, can be done by each slave. After that, the master synchronizes the event start, and keeps track of 3

4 the state of all slaves. It receives particles, and dispatches them to the rst free worker. When all workers are free, it signals the end of the event, collects all results from slaves and nally executes GUOUT as in the sequential version. This routine normally handles output for the current event. After the wanted number of events, he executes UGLAST and all required global input/output. If the nal phase is so complex that it becomes a bottleneck, this phase can also be parallelised. The slave program is very close to the sequential one. It makes the usual initialization, executes GUKINE where synchronization and eventually event generation is done, and then GTREVE where the event tracking is performed. This subroutine is changed in the parallel version in order to introduce the possibility of receiving tracks from the dispatcher. The worker tracks all the particles in the stack, and when there are no more particles left, it sends a message to the dispatcher. The last one sends it a particle if there is any in stock, or, at the end of the event, the signal for nishing the tracking phase. After this, the slave calls GUOUT where all partial results are saved or sent to the master. If this last phase is very complex or time consuming, other tasks can be created in order to distribute this nal charge. In our case, histograming and le handling is very simple, so we leave them to the master itself. The decision of whether a particle should be tracked or sent to the master is made in the user written routine GUSTEP. The user is free to choose a sending policy. A routine TSKING, analog of GSKING, handles the track transmission to the dispatcher. A block diagram of both master and slave sequences is shown in gure 2. This centralized sort of dispatching permits a very simple task control, but can be a factor of speedup decrease if the number of exchanged particles is very large, as expected when the number of processor increases. For some tens of processors, this is not the most important factor of eciency decrease. However, for very large scale implementations one must think about a distributed dispatching. Each slave has to stock in memory an amount of information that strongly depends on the application. This information consists mainly in geometry and material descriptions and calculated results (energy depositions, hits, etc.). When it is too large to t in memory, it can be shared between slaves. The dispatching criteria in the master, and the sending one in the slave can easily take this fact into account. For example a slave can contain only a part of the detector description, or it can calculate results from only one type of particles. Particles that not match slaves conditions are send to the master who redispatch them to the appropriated slaves. This implies, of course, a reduction of speedup caused by transmission delays and load balancing, but makes the implementation possible. A careful analysis must be performed for each particular case. 4

5 3 The TN310 computer We used the TN310 computer [6] to make our tests. It is a T9000 transputer based parallel computer, developed in the GPMIMD european Esprit program. The T9000 processor is the latest member of the transputer family. Apart from an increased computing power, it distinguishes itself from the T800 family by some new features. In particular, it has a built in communication to host C104 Contrrol C104 C104 C104 C104 N E W S to other mother boards / 4 / 4 T9000 T9000 T9000 T PR 1 PR 2 PR 3 PR 16 Control Links Fig. 3. The TN310 MotherBoard processor (the VCP) that discharges the main processor from communication tasks. Physically, the T9000 has 4 serial bidirectional links (as is the case for the T800), but new hardware permits the eective utilization of virtual channels. Communication is possible between any two processors in the same network, by hopping through other processors in the network without using their computing time. A companion chip, the C104, is a crossbar switch that can connect 32 physical channels. Details of these chips are given in [5]. The TN310 architecture is shown in gure 3. The computer can hold two motherboards, that is to say up to 32 processors, but larger congurations are also available. The programming environment that we used is the Inmos Toolset. All programs were translated to C language by means of the AT&T Bell Labs Fortran-to-C converter (f2c). This C version was then compiled, linked, and congured with the inmos Toolset, and nally run as a standalone program in the computer. All GEANT Library and CERNLIB was compiled and tested using this tools. All the input/output is done through a server program on the Unix host. In our case, the host is a Sparc20 Sun Workstation. 5

6 4 Results Here we present the results obtained with the Gexam1 example, with N initial photons. We used this well known example in order to have an easy check at Fig. 4. Speedup for N=1 initial particle every stage of the program development. In the slave GUSTEP program, we choose to send to the master 1 out of m secondary particles produced. The parameter m is read in the data cards and takes the values of 1, 2 or 3. The relation between the speedup and the portion of secondary particles transmitted to the dispatcher is shown in the gures. The speedup for n processors is dened by the relation: S n = T 1 T n The value of T 1 is the value of the 1 slave network, and it is very close to the sequential one processor time. We see that the speedup is limited in the 1 GeV case by the lack of particles. The situation for 10 GeV is better, specially when we send only half of the secondary particles produced. When the initial event is formed by many particles, the speedup is clearly increased, as shown in gure 5. Here we plotted the relation between speedup and the number of initial particles transmitted by the dispatcher, with a xed value of m = 2. 6

7 Fig. 5. Speedup vs. number of initial particles (m=2) Other strategies can give better results. The principal goal is to furnish enough particles to slaves, and to minimize communications at the same time. The dispatcher can provide a feedback to slaves in order to inuence the number of particles sent to it. In this case he can constitute a stock of particles, and command the slaves in order to reduce the frequency of particles transmitted. If the stock decreases, it can increase this frequency. In our case this is done by setting the parameter m after each particle received by the dispatcher. This parameter is set by the dispatcher in each slave to a value m = int! max(n stock? N 1 ; 0) N N processors where N 1 and N 2 are user xed parameters. N 1 is used to preserve a minimum stock and N 2 sets the feedback level. In gures 6 are shown results for the simplest N 1 = 0 and N 2 = 1 case, with dierent number of initial photons. We see that speedup is better than in the case of xed frequency as expected. We must keep in mind, however, that the criteria used for sending a particle back to the dispatcher might depend on the track characteristics. So, the real situation can be worse. 7

8 5 Conclusions Fig. 6. Speedup for feedback loading We showed that the speedup obtained in track parallelisation can be good for a medium number of processors. The results are better with increasing energy and complexity of the initial event. If not quite competitive with event level parallelisation, it can be unavoidable for very high energies or very complex simulations. In order to obtain a high speedup, however, one needs a very ecient and transparent communication network. T9000 is a good example of processor that implements this feature. With the use of the tools that we worked out, the track parallelisation can be done very easily for any standard Geant based simulation. 8

9 References [1] L.Duot, A.Jejcic, J.Maillard, J.Silva, G.Maurel Simulation of LHC calorimeters on the T Node parallel computer. \Workshop on detectors and event simulation in High Energy Physics" Amsterdam, April [2] L.Duot, A.Jejcic, J.Maillard, J.Silva, G.Maurel Operating HEP simulation codes on the T Node parallel computer. Computing in High Energy Physics Conference, Tsukuba Mars Preprint LPC [3] L.M. Bertolotto, et al. \Feasibility studies for a high energy physics MC Program on Massive Parallel Platforms", C.H.E.P. 1994, San Francisco, April [4] R. Brun et al., \GEANT Detector Description and Simulation Tool", Cern Program Library Long Writeup W5013. [5] The T9000 Transputer, Hardware Reference Manual, SGS/Thomson Microelectronics, [6] The TN310 computer, Telmat Multinode Training, R. Pathenay, June

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

Proceedings of the 2(X)0 IEEE International Conference on Robotics & Automation San Francisco, CA April 2000 1ms Column Parallel Vision System and It's Application of High Speed Target Tracking Y. Nakabo,