machine learning the future of lhc theoryplehn/includes/... · 2020-06-26 · machine learning —...
TRANSCRIPT
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Machine Learning — The Future of LHC Theory
Tilman Plehn
Universitat Heidelberg
MCNet 6/2020
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Why LHC?
Data from ATLAS & CMS
– HL-LHC = 2000 × Tevatron
– jet production σpp→jj × L ≈ 108fb× 1000/fb ≈ 1011 events
⇒ It’s proper big data
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Why LHC?
Data from ATLAS & CMS
– HL-LHC = 2000 × Tevatron
– jet production σpp→jj × L ≈ 108fb× 1000/fb ≈ 1011 events
⇒ It’s proper big data
Physics with jets
– re-summed perturbative QFT prediction from QCD
– jets as decay products67% W → jj 70% Z → jj 60% H → jj 67% t → jjj 60% τ → j ...
– new physics in ‘dark showers’
⇒ It’s interesting
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Why LHC?
Data from ATLAS & CMS
– HL-LHC = 2000 × Tevatron
– jet production σpp→jj × L ≈ 108fb× 1000/fb ≈ 1011 events
⇒ It’s proper big data
Physics with jets
– re-summed perturbative QFT prediction from QCD
– jets as decay products67% W → jj 70% Z → jj 60% H → jj 67% t → jjj 60% τ → j ...
– new physics in ‘dark showers’
⇒ It’s interesting
Monte Carlo data
– generators: Sherpa, Herwig, Pythia, Madgraph
– based on QFT-Lagrangian
– data-to-data comparison: MC vs LHC
⇒ It’s properly understood
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Why LHC?
Data from ATLAS & CMS
– HL-LHC = 2000 × Tevatron
– jet production σpp→jj × L ≈ 108fb× 1000/fb ≈ 1011 events
⇒ It’s proper big data
Physics with jets
– re-summed perturbative QFT prediction from QCD
– jets as decay products67% W → jj 70% Z → jj 60% H → jj 67% t → jjj 60% τ → j ...
– new physics in ‘dark showers’
⇒ It’s interesting
Monte Carlo data
– generators: Sherpa, Herwig, Pythia, Madgraph
– based on QFT-Lagrangian
– data-to-data comparison: MC vs LHC
⇒ It’s properly understood
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Why not LHC?
ATLAS & CMS
– 3000 know-it-alls per experiment
– many just interested in detector
– top-down organized analysis groups
⇒ Shockingly little innovation
Expertize
– LHC data format: ROOT
– multi-variate analyses tool: TMVA
– Tensorflow from TMVA/ROOT
⇒ Limited sense of ML-urgency
Experiment vs theory
– theorists linked to lack of team compatibility
– simulated data as good as actual data
– excellent personal ex-th connections
⇒ Someone has to drive developments...
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
1– Jet classification: Nothing is ever new
LHC visionaries– 1991: NN-based quark-gluon tagger [Lonnblad, Peterson, Rognvaldsson]
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
1– Jet classification: Nothing is ever new
LHC visionaries– 1991: NN-based quark-gluon tagger [Lonnblad, Peterson, Rognvaldsson]
– 1994: jet algorithm for W , top... [Seymour]
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
And it is also done
Experiments driving, for once... [Ben’s talk]
– 2014/15: first jet image papers
– 2017: first (working) ML top tagger
– ML4Jets 2017: What architecture works best?
– ML4Jets 2018: Lots of architectures work [1902.09914]
⇒ Jet classification understood and done
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
And it is also done
Experiments driving, for once... [Ben’s talk]
– 2014/15: first jet image papers
– 2017: first (working) ML top tagger
– ML4Jets 2017: What architecture works best?
– ML4Jets 2018: Lots of architectures work [1902.09914]
⇒ Jet classification understood and done
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
And it is also done
Experiments driving, for once... [Ben’s talk]
– 2014/15: first jet image papers
– 2017: first (working) ML top tagger
– ML4Jets 2017: What architecture works best?
– ML4Jets 2018: Lots of architectures work [1902.09914]
⇒ Jet classification understood and done
What is new and cool and fun?
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
What about error bars?
Jet-by-jet uncertainties [Walter: Bayesians have more fun]
– (60±??)% top,
– probability for test event p(c∗|C) [classifier output C, network ω]
p(c∗|C) =
∫dω p(c∗|ω,C) p(ω|C) =
∫dω p(c∗|ω,C) q(ω)
– loss: minimize KL-divergence + Bayes
KL[q(ω), p(ω|C)] =
∫dω q(ω) log
q(ω)
p(ω|C)
=
∫dω q(ω) log
q(ω)p(C)
p(C|ω)p(ω)
= KL[q(ω), p(ω)]︸ ︷︷ ︸L2-regularization
+ log p(C)
∫dω q(ω)︸ ︷︷ ︸
normalization of q, irrelevant
−∫
dω q(ω) log p(C|ω)︸ ︷︷ ︸likelihood, maximized
⇒ L = KL[q(ω), p(ω)]−∫
dω q(ω) log p(C|ω)
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
What about error bars?
Jet-by-jet uncertainties [Walter: Bayesians have more fun]
– (60±??)% top,
– probability for test event p(c∗|C) [classifier output C, network ω]
p(c∗|C) =
∫dω p(c∗|ω,C) p(ω|C) =
∫dω p(c∗|ω,C) q(ω)
– loss: minimize KL-divergence + Bayes
KL[q(ω), p(ω|C)] =
∫dω q(ω) log
q(ω)
p(ω|C)
=
∫dω q(ω) log
q(ω)p(C)
p(C|ω)p(ω)
= KL[q(ω), p(ω)]︸ ︷︷ ︸L2-regularization
+ log p(C)
∫dω q(ω)︸ ︷︷ ︸
normalization of q, irrelevant
−∫
dω q(ω) log p(C|ω)︸ ︷︷ ︸likelihood, maximized
⇒ L = KL[q(ω), p(ω)]−∫
dω q(ω) log p(C|ω)
⇒ sample ω to extract (µpred, σpred)
check prior independence
check frequentist many-networks
89%
0.040.9
-0.8
80%
0.6-0.3
0.02
70%
0.1-0.3
0.01
Ensemble of networks
μpredσpred
BNN
sampling
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Statistics
Training statistics [Bollweg, Haussmann, Kasieczka, Luchmann, TP, Thompson]
– Bayesian version of DeepTop and LoLa
– similar performance as deterministic networktraining time somewhat increased
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Statistics
Training statistics [Bollweg, Haussmann, Kasieczka, Luchmann, TP, Thompson]
– Bayesian version of DeepTop and LoLa
– similar performance as deterministic networktraining time somewhat increased
– correlation between µpred and σpred [toy network, 10k jets]
– increasing training statistics [parabola from closed interval output]
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Statistics and systematics
Regression: measure pT ,t [Kasieczka, Luchmann, Otterpohl, TP]
– effect of noisy and size-limited data separated
σpred: limited training sample
σstoch: statistical behavior of training data [Gaussian likelihood]
log p(C|ω)→ log p(C|µ, σstoch) =(C − µ)2
2σ2stoch
+12
logσ2stoch + const
σ2tot = σ2
pred + σ2stoch [all Gaussian]
⟨pT⟩ =1N
N
∑i
⟨pT⟩ωi
σ2pred =
1N
N
∑i
(⟨pT⟩ − ⟨pT⟩ωi)2
BNN
samp
ling
σ2stoch =
1N
N
∑i
σ2stoch, ωi
Output
output
( ⟨pT⟩ω1
σstoch, ω1)
Ensemble of networks
0.2 0.8-0.1
-0.30.70.5
0.9-0.2
0.4
( ⟨pT⟩ω2
σstoch, ω2)
(⟨pT⟩ω3
σstoch, ω3)
q(ω)
x
x
x
x
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Statistics and systematics
Regression: measure pT ,t [Kasieczka, Luchmann, Otterpohl, TP]
– effect of noisy and size-limited data separated
σpred: limited training sample
σstoch: statistical behavior of training data [Gaussian likelihood]
log p(C|ω)→ log p(C|µ, σstoch) =(C − µ)2
2σ2stoch
+12
logσ2stoch + const
σ2tot = σ2
pred + σ2stoch [all Gaussian]
– sample size dependence [statistics saturating]
104 105 106
Training size
0
10
20
30
40
50
60
70
[GeV
]
pT, j = 600...620 GeV
MSEtotstochpred
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Statistics and systematics
Regression: measure pT ,t [Kasieczka, Luchmann, Otterpohl, TP]
– effect of noisy and size-limited data separated
σpred: limited training sample
σstoch: statistical behavior of training data [Gaussian likelihood]
log p(C|ω)→ log p(C|µ, σstoch) =(C − µ)2
2σ2stoch
+12
logσ2stoch + const
σ2tot = σ2
pred + σ2stoch [all Gaussian]
– sample size dependence [statistics saturating]
– dependence on ISR and top-ness
⇒ Reasonable error estimate
500 550 600 650 700 750 800pT, j [GeV]
20
30
40
50
60
70
80
Unce
rtain
ty [G
eV] with ISR
without ISR
without ISR (25% most top like)
MSE tot
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Jet calibration
Calibration means error propagation
– training on smeared data??better: training with smeared labels [pT measured elsewhere, with error]
– Gaussian noise over pT ,t label [e.g. 4%]
– distribution of extracted pT ,tcorrelation extending to error barsslice with expected non-Gaussian tail from QCD radiation
500 600 700 800 900pT, j [GeV]
450
500
550
600
650
700
750
800
850
900
p T,t
[GeV
]
max +/- (68%)truthpredicted
400 600 800 1000 1200pT, t [GeV]
0.000
0.002
0.004
0.006
0.008
0.010
Norm
alize
d
68%
pT, j = 600...620 GeV
truthpredicted
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Jet calibration
Calibration means error propagation
– training on smeared data??better: training with smeared labels [pT measured elsewhere, with error]
– Gaussian noise over pT ,t label [e.g. 4%]
– distribution of extracted pT ,tcorrelation extending to error barsslice with expected non-Gaussian tail from QCD radiation
– trace label smearing to network outputmaking sense of σnoise
⇒ Works!
0 20 40 60 80 100 120smear [GeV]
0
20
40
60
80
100
120
[GeV
]
pT, j = 600...620 GeV
smearing inputBNN cal
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
2– Learning from art
GANGogh [Bonafilia, Jones Danyluk (2017)]
– old news: NNs turning pictures into art of a certain epochbut can they create new pieces of art?
– train on 80,000 pictures [organized by style and genre]
– map noise vector to images
– generate flowers
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
2– Learning from art
GANGogh [Bonafilia, Jones Danyluk (2017)]
– old news: NNs turning pictures into art of a certain epochbut can they create new pieces of art?
– train on 80,000 pictures [organized by style and genre]
– map noise vector to images
– generate portraits
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
2– Learning from art
GANGogh [Bonafilia, Jones Danyluk (2017)]
– old news: NNs turning pictures into art of a certain epochbut can they create new pieces of art?
– train on 80,000 pictures [organized by style and genre]
– map noise vector to images
Edmond de Belamy [Caselles-Dupre, Fautrel, Vernier]
– trained on 15,000 portraits
– sold for $432.500
⇒ all about marketing and sales
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
2– Learning from art
GANGogh [Bonafilia, Jones Danyluk (2017)]
– old news: NNs turning pictures into art of a certain epochbut can they create new pieces of art?
– train on 80,000 pictures [organized by style and genre]
– map noise vector to images
GANGogh for jet images [de Oliveira, Paganini, Nachman]
– start with calorimeter images or jet imagessparsity the technical challenge
1- reproduce valid jet images from training data
2- organize them by QCD vs W -decay jets
– high-level observables to check
– not sold for cash
⇒ all about understanding
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
GANs at LHC
Phase space networks
– MC integration [Bendavit (2017)]
– NNVegas [Klimek (2018), not really generative network]
Existing GAN studies [Anja’s talk]
– Jet Images [de Oliveira (2017), Carazza (2019)]
– Detector simulations [Paganini (2017), Musella (2018), Erdmann (2018), Ghosh (2018), Buhmann (2020)]
– Event generation [Otten(2019), Hashemi (2019), Di Sipio (2019), Butter (2019), Martinez (2019), Alanazi (2020)]
– Unfolding [Datta (2018), Bellagente (2019)]
– Templates for QCD factorization [Lin (2019)]
– EFT models [Erbin (2018)]
– Event subtraction [Butter (2019)]
Event generators
– neural importance sampling [Bothmann (2020)]
– i-flow in SHERPA [Gao (2020)]
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
GANs at LHC
Phase space networks
– MC integration [Bendavit (2017)]
– NNVegas [Klimek (2018), not really generative network]
Existing GAN studies [Anja’s talk]
– Jet Images [de Oliveira (2017), Carazza (2019)]
– Detector simulations [Paganini (2017), Musella (2018), Erdmann (2018), Ghosh (2018), Buhmann (2020)]
– Event generation [Otten(2019), Hashemi (2019), Di Sipio (2019), Butter (2019), Martinez (2019), Alanazi (2020)]
– Unfolding [Datta (2018), Bellagente (2019)]
– Templates for QCD factorization [Lin (2019)]
– EFT models [Erbin (2018)]
– Event subtraction [Butter (2019)]
Event generators
– neural importance sampling [Bothmann (2020)]
– i-flow in SHERPA [Gao (2020)]
What is new and cool and fun?
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Super-resolution (preview)?
Getting inspired [Blecher, Butter, Keilbach, TP + Irvine]
– take high-resolution calorimeter imagesdown-sample to 1/8th 1D resolutionGAN inversion
– start from low-resolution calorimeter imagesGAN high-resolution images
– works because GANs learn structure [showers are QCD]
– energy of constituents no.1,10,30
⇒ GANs are (kind of) magic
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
What about MC-inversion?
Unfolding as inversion [Bellagente, Butter, Kasieczka, TP, Rousselot, Winterhalder]
– network as bijective transformation — normalizing flowJacobian tractable — normalizing flowevaluation in both directions — INN [Ardizzone, Kruse, Rother, Kothe]
– building block: coupling layer
xd ∼ g(xp) with∂g(xp)
∂xp=
(diag es2(xp,2) finite
0 diag es1(xd,1)
)– padding by yet more random numbers(
xprp
) PYTHIA,DELPHES:g→←−−−−−−−−−−−−−−−→
← unfolding:g
(xdrd
)
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
What about MC-inversion?
Unfolding as inversion [Bellagente, Butter, Kasieczka, TP, Rousselot, Winterhalder]
– network as bijective transformation — normalizing flowJacobian tractable — normalizing flowevaluation in both directions — INN [Ardizzone, Kruse, Rother, Kothe]
– building block: coupling layer
xd ∼ g(xp) with∂g(xp)
∂xp=
(diag es2(xp,2) finite
0 diag es1(xd,1)
)– padding by yet more random numbers(
xprp
) PYTHIA,DELPHES:g→←−−−−−−−−−−−−−−−→
← unfolding:g
(xdrd
)⇒ proper sampling
10 15 20 25 30 35 40 45 50pT,q1
[GeV]
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
even
tsn
orm
aliz
ed
INN
FCGAN
single detector event3200 unfoldings
Parton
Tru
th
0.0 0.2 0.4 0.6 0.8 1.0quantile pT,q1
0.0
0.2
0.4
0.6
0.8
1.0
frac
tion
ofev
ents
INN
FCGAN
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Conditional INN
Even more random sampling: conditional network
– same as Anja’s FCGAN [Omnifold]
– parton-level events from random numbers
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Conditional INN
Even more random sampling: conditional network
– same as Anja’s FCGAN [Omnifold]
– parton-level events from random numbers
– calibration for statistical unfolding
10 15 20 25 30 35 40 45 50pT,q1
[GeV]
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
even
tsn
orm
aliz
ed
cINN INN
FCGAN
single detector event3200 unfoldings
Parton
Tru
th
0.0 0.2 0.4 0.6 0.8 1.0quantile pT,q1
0.0
0.2
0.4
0.6
0.8
1.0
frac
tion
ofev
ents
cIN
N
INN
FCGAN
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Conditional INN
Even more random sampling: conditional network
– same as Anja’s FCGAN [Omnifold]
– parton-level events from random numbers
– calibration for statistical unfolding
Unfolding extra jets
– detector-level process pp → ZW+jets [variable number of objects]
– parton-level hard process chosen 2→ 2 [whatever you want]
– ME vs PS jets decided by network [including momentum conservation]
20 40 60 80 100 120 140pT,q1
[GeV]
10−4
10−3
10−2
1 σdσ
dpT,q
1[G
eV−
1]
2 jet
3 jet
4 jet
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0∑px/∑|px| [GeV] ×10−4
101
102
103
104
1 σdσ
d∑
px/∑|px|
[GeV−
1]
2 jet
3 jet
4 jet
⇒ proper statistical inversion!
ML Future
Tilman Plehn
Big LHC data
Classification
Error bars?
Generation
Inversion?
Outlook
Machine learning a great tool box
LHC physics really is big data
jet classification was a starting point
generative networks exciting for theory
physics questions: errors, precision, control, theory insight
What is new and cool and fun?