Using Boosted Decision Trees to Separate Signal and Background

Using Boosted Decision Trees to Separate Signal and Background in B X s γ Decays James Barber 16 August, 2006 University of Massachusetts, Amherst jbarber@student.umass.edu James Barber p.1/19

PEP II/BaBar at SLAC James Barber p.2/19

BaBar Detector James Barber p.3/19

e + e Υ(4S) B B 9 + 3.1 GeV = 12.1 GeV lab frame 10.58 GeV in center of mass frame Υ(4S) resonance: 10.58 GeV 2x m B = 10.558 GeV James Barber p.4/19

Decay of a B meson James Barber p.5/19

b sγ B 0 : bd B + : bu B0 : b d B : bū d d The b quark can decay to an s quark via this process Radiative Penguin Decay Predicted and Experimental Branching Fractions: B theo (B X s γ) = (3.61 ± 0.49) 10 4 B exp (B X s γ) = (3.55 ± 0.26) 10 4 James Barber p.6/19

b sγ? B 0 : bd B + : bu B0 : b d B : bū d d The b quark can decay to an s quark via this process Radiative Penguin Decay Predicted and Experimental Branching Fractions: B theo (B X s γ) = (3.61 ± 0.49) 10 4 B exp (B X s γ) = (3.55 ± 0.26) 10 4 James Barber p.6/19

How the detector can tell the difference The PEP-II/BaBar B-Factory Run: 2405635 Timestamp: 1d:ffffffff:000017/090915a6:H Date Taken: Tue Dec 31 16:27:42.787195000 1996 PST? HER: 8.990 GeV, LER: 3.112 GeV James Barber p.7/19

How the detector can tell the difference IT CAN T!! The PEP-II/BaBar B-Factory Separation must be done in software Run: 2405635 Timestamp: 1d:ffffffff:000017/090915a6:H Date Taken: Tue Dec 31 16:27:42.787195000 1996 PST? HER: 8.990 GeV, LER: 3.112 GeV James Barber p.7/19

Expected amounts of Signal and Background Photon Energy Spectrum Events / 20 MeV 10 6 contin expect bbar expect 10 5 10 4 10 3 10 2 10 signal expect For 10 2 to 10 3 signal events there are 10 5 and 10 6 background events! 3 or 4 orders of magnitude worth of background must be suppressed to get a signal 1 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 E* γ (GeV) - Simulation James Barber p.8/19

Practice on Fake Data! Event Variables: Monte Carlo (MC) Simulated Data Life-like variables leptonmomentumcm costhetagamma econe1 econe2. econe18 egammab James Barber p.9/19

Practice on Fake Data! Event Variables: Monte Carlo (MC) Simulated Data Life-like variables type and class variables leptonmomentumcm costhetagamma econe1 econe2. econe18 egammab type class James Barber p.9/19

Separating with Variables Pick a variable with high separation power Nice, fake data 22 Signal 20 Background 18 16 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 James Barber p.10/19

Separating with Variables Pick a variable with high separation power, Divide the data in a nice way. Nice, fake data 22 Signal 20 Background 18 16 14 12 10 8 6 4 2 rejected accepted 0 0 1 2 3 4 5 6 7 8 9 10 James Barber p.10/19

Separating with Variables Pick a variable with high separation power, Divide the data in a nice way. No Such Variable! leptonmomentumcm 4500 4000 3500 Legend Signal Background 3000 2500 2000 1500 1000 500 0 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 GeV James Barber p.10/19

Enter: Boosted Decision Trees (BDTs) BDTs are an advanced (complicated) method of separation, that learns to distinguish between signal and background events James Barber p.11/19

Enter: Boosted Decision Trees (BDTs) MC Data MC Events BDTs are an advanced (complicated) method of separation, that learns to distinguish between signal and background events BDTs need to be taught to separate signal and background trained on MC events James Barber p.11/19

Enter: Boosted Decision Trees (BDTs) MC Data Root Node MC Events BDTs are an advanced (complicated) method of separation, that learns to distinguish between signal and background events BDTs need to be taught to separate signal and background trained on MC events Train on an event sample representative of the data range All training events form the root node of a BDT James Barber p.11/19

How a Decision Tree works B 4/37 S 7/1 S/B 52/48 < 100 100 PMT Hits? S/B 9/10 S/B 48/11 < 0.2 GeV 0.2 GeV Energy? < 500 cm 500 cm Radius? B 2/9 Root Node S 39/1 Start with some number of Monte Carlo events in a root node James Barber p.12/19

How a Decision Tree works B 4/37 S 7/1 S/B 52/48 < 100 100 PMT Hits? S/B 9/10 S/B 48/11 < 0.2 GeV 0.2 GeV Energy? < 500 cm 500 cm Radius? B 2/9 S 39/1 Start with some number of Monte Carlo events in a root node Pick a variable/value combination to separate the events If node has > specified purity, or < specified number of events, stop separating James Barber p.12/19

How a BDT works B 4/37 S 7/1 S/B 52/48 < 100 100 PMT Hits? S/B 9/10 S/B 48/11 < 0.2 GeV 0.2 GeV Energy? < 500 cm 500 cm Radius? B 2/9 S 39/1 Missclassified events are given boosted weights A new root node is made, and a new tree generated 500 or 1000 trees made in this way, with boosting after each classification A BDT learns from its mistakes! James Barber p.13/19

Testing a forest of BDTs MC Data Testing Data MC Events Testing Events All events not used for training are used for testing These events are run through every tree we created in training Every time an event is classified as signal, it gets Nsignal incremented Calculate a likelihood, l l = N signal Ntrees Background tends towards 0, signal tends towards 1 Plot likelihood to see separation James Barber p.14/19

Likelihood Values Signal and Background Separation 500 Signal BBbar Continuum 400 300 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 James Barber p.15/19

Determining separation quality Calculate a Figure of Merit, Q Q = S S+B +C S: sum of signal events selected B = B (1 + f B error ) C = 1 f C B: sum of B B background events selected C: sum of continuum events selected f: on peak percentage (.9) B error : accounts for uncertainty in exact amount of B B background in data James Barber p.16/19

Determining separation quality Efficiencies 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 signal BBbar background continuum 2 2.2 2.4 2.6 2.8 EGammaStar(GeV) Calculate efficiency separation take total events kept by selection algorithm divide by total starting events James Barber p.16/19

What I did Get this all working! Find parameter configuration which gives the maximum Figure of Merit, Q Some parameters tested + results (749,683 total events) Training events: 350,000 Trees: 1,000 MinEvents/node: 50 Cuts: 50 Q = 13.99 Training events: 350,000 Trees: 500 MinEvents/node: 50 Cuts: 50 Q = 13.81 Training events: 100,000 Trees: 1000 MinEvents/node: 50 Cuts: 50 Q = 18.21 Training events: 100,000 Trees: 1000 MinEvents/node: 100 Cuts: 50 Q = 18.18 James Barber p.17/19

And the winner is.. Training events : 100,000 Trees: 1,000 MinEvents/node: 50 Cuts: 100 Q = 18.37 Signal and Background Separation 500 Signal BBbar Continuum 400 300 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 James Barber p.18/19

Conclusion Better signal and background separation reduces uncertanties Important to make precision measurements Increase the sensitivity for new physics Possible to have a previously unseen heavy particle in penguin loop H? James Barber p.19/19