HSF Generator Workshop, 26-28 November 2018
This live notes document is available to any participant in the workshop
to write down important points or conclusions from the workshop and to
identify areas where we will want to follow up or need to resolve open
questions.
Please use it - it’s an essential tool for the workshop’s memory!
The Organisers
Monday AM
CWP Challenges and Workshop Aims (Liz Sexton-Kennedy)
- Generators critical part of what experiments do. Aware that
“reward system” has limitations, need to improve on this. HSF
CWP suggests reviewing software credits. HSF CWP has a section on
generators and one on training and careers.
- Estimated CPU computing resources needed for HL-LHC exceed those
that are affordable on a “flat budget”. Pileup dramatically
increases complexity. Need major reengineering of the software in
all areas to be able to fully exploit HL-LHC potential. Especially
need to exploit parallelism and vectorization of modern
architectures, and GPUs are even more challenging.
- Physics of interest at HL-LHC is at smaller cross sections and
higher multiplicity.
- Generators CAN be computationally intensive. Larger consumption in
ATLAS than CMS. ATLAS US wall clock budget in 2016 had around 20%
spent on generators.
- NLO and beyond NLO pose large challenges. Theorist community
continues to push development a higher orders. HL-LHC will drive
need to
generate events make predictions accurate at higher
order.
- Recap from Run1 to Run4.
- Run1 mainly LO, O(100ms).
- Run2 moved to NLO with O(10-100s) execution time.
- Total grid times in an experiment depends on mixture of LO and
NLO required by analysts.
- ATLAS/CMS create O(10B) events per year, rule of thumb is MC =
1-2x data events.
- Run4 will collect 10x events as in Run2. Will need more
precision hence higher orders. Will have more particles, with
more phase space.
- All within a flat budget…
- Modernization means: many core, SIMD vectorization, NUMA memory
hierarchy, offload to GPU/FPGA/TPU accelerators.
- Need to inject engineering help. Move towards Open Source
development is a way to achieve this.
- Agenda of the workshop explained:
- Introduction to generators for this mixed audience
- Experiment needs
- Technical requirements on modern codes
- Status of main generators. Can we
agree on a way to benchmark them?
- Ongoing work on HPCs
- Event reweighting? Share samples between ATLAS and CMS?
- Status of generators used as “decayers”
- NNLO status
- Follow-up planning
- Hackathon
- C/ZachMarshall: reference list in CMS is a twiki, but also in
ATLAS. There is also a common twiki between the experiments and
the theorists can contribute
- C/Zach about HL-LHC: collect 10x events, do we need 10x MC events?
It depends on balance between statistical and systematic errors.
In many cases we are limited by statistics and we do not need
that.
- C/StefanRoiser: for LHCb please get in touch with Patrick
Koppenburg for reference lists
- C/StefanoFrixione: not sure how many people read references, e.g.
name of madgraph generator
- C/StefanoFrixione: not sure what the needed precision at HL-LHC
will be. A/Liz: hope that we get an answer to this question from
the workshop.
- C/AndyBuckley: generator codes are already “open” — but these are
not things that can just be dipped into and have significant
improvements made, as many have discovered. Impossible to do
without close collaboration & discussion with authors… who will
approve a huge & invasive pull request that hasn’t been previously
discussed? Code hosting systems have also moved toward modern
web-driven interfaces, but I don’t think that’s been a real
limiting factor.
Introduction to Generator Codes (Simon Platzer)
- Introduction to generators for a mixed audience.
- Factorization at different energy scales: hard process
calculation, parton shower algorithms, multiple interaction
models, hadronization models. Event weight comes from hard cross
section.
- Hard process is exact calculation in QCD/QED perturbation
theory.
- Most developments and computing power requirements sit in
hard process calculation and parton shower algorithms.
- Hard process calculation and parton showers.
- Jacobian factor to map unit hypercube of phase space to
physical momenta.
- At high momentum transfer, expand QCD in LO, NLO, NNLO and so
on. Real and virtual contributions
are separately divergent and cancel out. This introduces
subtraction terms which lead to negative weights, this is
unavoidable.
- Landscape of infrared sensitive observables. At every fixed
order of expansion, need more and more logarithm terms.
- Factorization of emissions. Resummation and parton showers.
NLO matching.
- This is all state of the art for all major generators. In
order to produce infrared safe predictions, negative weights
are unavoidable.
- Event generator structure is based on the factorization discussed
at the beginning
- Challenges
- Negative weights spoil probabilistic interpretation
- Q/AndreaValassi: has the issue of negative weights hit us already
or not and how much? A/Liz and Simone Amoroso: it has hit us
already, but this is only where NLO is needed. For instance some
SUSY analysis went back from NLO to LO because of this. Simone:
note that where NLO is needed and there
are 40% of events with negative weights, we need 25x more events
to generate! If you can stay at LO, you have no negative
weights.
- C/Zbiszek: note that at LEP times the problem of negative
weights was already there for QED but was eventually solved.
- C/StefanoFrixione: disagree, IR is only one part of the
problem leading to negative weights, and it is very difficult
to design a catch-all solution.
ATLAS Needs and Concerns (Josh McFayden)
- Generation is 20% in projected 2028 needs for ATLAS. But the slice
will become more important as improvements come in fast detector
simulation and elsewhere.
- Majority of CPU comes from sherpa v2.2 v+jets setup, by far the
largest and most precise samples.
- Average CPU/event: 80s evgen, 240s fullsim, 45s fastsim, 60ds
reco.
- Thanks for all recent improvements to the theory community!
- Problems!
- Negative weights. If the fraction is >25%, using full
detector simulation becomes impractical.
- Precision vs CPU. More precision needs more CPU.
- Efficiently populating extreme regions of phase space.
- Possible routes out
- Can experiments help in generator development? Investigating
funding of positions by the experiments.
- Sacrifice speed for modelling/precision? Choose faster
generator? Reweight lower precision to higher precision?
- PRELIMINARY benchmarking
- W+jets (2jets NLO and 4jets LO). Sherpa vs MG5_aMC+Py8.
Larger CPU and memory consumption for sherpa, marginally
larger fraction of negative weights for MG/aMC.
- Note that fraction of negative weights is not a relevant
metric. You also need to know the spread of the values of the
negative weights to understand their impact.
Need to design a better metric for
negative weights.
- ttbar comparison, also including Herwig and powheg
- Infrastructure
- Common computing infrastructure for ATLAS/CMS?
- Will generators run on CPUs or GPUs in 2028?
- ATLAS and CMS sharing samples?
- Summary
- Many improvements in recent years
- Certainly low hanging fruits. Event generation has not been
highly scrutinised until now.
- Optimisation can help, has to discuss how
- Q: on slide 4, how much of generation is on the grid? A: do use
the external sherpa pre-integration
CMS Needs and Concerns (Efe Yazgan)
- Most LHC events are accompanied by hard jets, which are required
or vetoed in many analyses, hence need MC predictions for them.
- Most commonly used in CMS: MG5_aMC@NLO+Py8
- CMS software is C++ driven by python. MC production submitted to
grid resources, software available via cvmfs. External generators
called from CMSSW through an externalLHEProducer module.
- Gridpacks: tarballs with precompiled code and initial phase
space integration results. They are placed in cvmfs and
accessed by remote jobs. In production, significant time spent
in untarring the tarballs.
- In Run2, generators were 1-10% of total CPU (per MC event,
depending on type of MC event - not a % relative to full grid
budget). Multi-leg NLO has negative weights up to 40%, larger
samples needed.
- Beyond Run2: much larger samples, larger alternative samples for
systematic studies, precise tails of distributions need more
precise predictions (NLO, hence negative weights)
- Table with use of event weights for systematic studies
- MG5_aMV bias weights to reweight LO to NLO (see O. Mattelaer’s
arxiv)
- Needed developments
- Timing comparison and understanding in ATLAS and CMS
- Reduction of negative weights at NLO
- Multi threading, vectorization, GPUs, reduce memory
consumption
- Faster phase space integration (neural networks, GPUs:
10-5000x faster possible?)
- Collaboration ATLAS-CMS, support positions
- C/Servesh: estimates 10-5000 faster depend on how you compute
this. C/Peter: in that paper, faster was for things that are
already fast!
- Q/slide 7: why the difference in gridpacks between MG and sherpa?
A: not sure
- Q/AndreaValassi slide 12: do you actually see problems fitting
jobs in 2GB per core for generation? A/Liz: in CMS we do share
geometry in multi threaded framework
- C/StefanoFrixione: not sure it’s a good idea to use the same
generator evel events between experiments. Good that experiments
cross check each other. A/Graeme: this will be discussed more in
detail in tomorrow’s discussion.
LHCb Needs and Concerns (Phil Ilten)
-
LHCb designed as B experiment. Decays are
the most delicate part and go through EvtGen. Handle
~5000 signal decays. Then LHCb expanded physics program and uses
multiple purpose generators. 90% goes through multi-parton
interaction (NOT message passing interface MPI!).
- All MC production is done centrally. Every job selects a model and
an event type (i.e. a decay description).
- GAUSS internals description. Includes pileup typically in pythia8
and decay typically in EvtGen, plus ME and shower for specific
generator. Pileup tool generally included at fixed luminosity due
to luminosity levelling in LHCb running. Large number of
generators for general purpose, soft and hard processes.
- Recent developments to cope with Run3: filtered events, multiple
trigger conditions, redecay, particle gun events. Gaussino in
progress to be experiment-agnostic, for other experiments and also
theorists. Also looking at options for faster detector
simulations, including GAN and DELPHES.
- Q/Liz on thread safety: did you discuss this for EvtGen?
A/MichalKreps: will discuss this tomorrow
- Q/StefanoFrixione: are you using Pythia6? A: no, working on
eliminating it
- Q/AndreaValassi: no problem with negative weights? A: not now but
we will have a problem eventually
- Q/Servesh: how do you load C++ libraries? A: typically compile
time rather than dynamic. C/Servesh: good, compile time is much
faster.
- Q/SimoneAmoroso: so basically mostly run Pythia8 standalone? What
is your event generation budget? A/MichalKreps: for many cases it
is not a limiting factor as Geant4 is much more intense. But if
the user imposes cuts to save time on Geant4, this puts more
pressure on generation. A/GloriaCorti: yes in these cases
generation is the dominant budget. And
event generation budget fraction will get
worse as we go to a faster detector simulation. We don’t
run Pythia8 standalone but together with detector simulation.
- N.B. MPI in this talk means multi-parton interactions!
Practical Computing Considerations (David Lange)
- MC production (generators plus detector simulation) is the major
consumer of CPU on the grid. Requests come in bursts, driven by
analysis deadlines like conferences. Gridpacks are a major
enterprise for non experts.
- Two approaches between experiments:
- CMS like: gen+simulation, pileup+digi+trigger, reco. Good:
generator is small part of gen+sim. Bad: generators change
frequently, Geant4 you want stable instead across years.
- ATLAS like: also separate gen from simulation. Allows more
flexibility in generator software.
- Grid runs 500k+ cores across 100 sites. Job slots 1-8 cores and
1-2 days maximum wall clock time. HPC resources are spiky.
- Aggressive code optimization of reco led to very large
improvements in time/event.
- Code should be robust to run B events. Startup time should be
minutes not hours. Code should be runnable as a library in the
experiment specific framework.
- Common wisdom “generators small compared to the rest”, this is
incorrect. Examples: gridpack untarring, move from LO to NLO and
NNLO, filters with low efficiency for selecting events.
- Examples of coding practices that enabled reco software
improvements.
- Examples of recent generator improvements in CMS. Had one NLO
generation workflow where 99%+ was the initialization of NLO
generator. Got a 3X improvement by improving specific functions
and by doing initialization exactly once.
- Software licenses
- HEP converged on open source, but there are many such licenses
- Experiment software becomes GPL by inheritance if it uses GPL
components (although “derivative work” interpretation is a
very open discussion). And GPL2 and GPL3 are not even
compatible with each other.
- Coming up now because experiments are trying to solidify their
situations. CERN recently chose Apache2 to ease collaboration
with industry.
- We have to improve on software citation.
- C/StefanRoiser: for LHCb licensing was a long and complex
business. Chose GPL3 in the end.
- C/StefanRoiser on workflows: LHCb is spending 70 to 80% of
resources only in Geant4, will keep generators together because
this makes it easier to run on unpledged resources without input
data.
Discussion
- Q/PhilIlten: how efficient are ATLAS/CMS in setting generator
level cuts for specific users? A/JoshMF: depends on the sample,
various approaches.
- A/Zach: the closer to the ME you can put the cuts the faster
you manage to get. Can some hooks be
added to generators in a common standard way to set generator
level cuts?
- C/VitalianoCiulli: CMS some time ago was doing something like
redecays, it was more reshowering. We had to do something inside
Pythia to achieve that, but maybe it was not documented even if it
is in principle reusable. A/PhilIlten: working with PeterSkands to
do this in a more reusable way.
- C/StephenMrenna about inheritance.
- Q/JoshMF: did not really manage to
understand yet why the ATLAS and CMS budget for generation is so
different. A/Graeme: very difficult to answer this
question here, but we should identify at the workshop ways of
working on this, by putting the relevant people
together.
Monday PM
- Technical optimization is the focus, it can not be an after coding
activity
- Moore’s Law still working after 42 years but Denard scaling was
stopped by power consumption.
- GPU FP performance for non-terminated lines is ether flat or
dropping
- Post free lunch - Agricultural performance is improving danger in
lock in to a specific architecture, note the Xeon Phi cautionary
tale
- Roofline Performance Comparison - represents how well an algorithm
for a specific architecture - defines the concept of operational
intensity
- DGEMM is the Linpack benchmark
- Recommend finding the flops/ byte for your algorithms
- Cautionary tale of the chinese supercomputer - a top entry at
SC but not useful for science - on the other hand some SC
sites take input on kernel benchmarks to use
- Vector Operation discussion
- 60% of the compiler implementation is devoted to loop optimization
so it is wise to allow the compiler to understand your loops
- Algorithms can be classified into 4 types, figure out which one
yours is suited to
- Ideal programs have computational kernels - HEP codes don’t have
them, why? Maybe too many years of optimizing hot spots?
- Discussion of the topic of the cost of virtual functions and
templates
Optimising Memory Use (Sebastien Ponce, CERN)
- The memory problem - relative to CPU speeds memory access is
extremely slow (100x), mitigated by hierarchical cache structure
- Make sure stack variables are <1kbyte, within current scope but
allocation is almost free
- Big objects go on the heap but they are very costly to allocate
- Pro-Con struct of array vs. array of pointers, the latter scatters
the memory and creates many small allocations - the latter will
thrash the cache
- Contiguous memory blocks must be used to limit the thrashing
- When using a vector of structs make sure you use “reserve”
- Array of Structures vs. Structure of Arrays which is best depends
on the algorithm applied to the data SofA would be bad for a sort.
SofA is best for vectorizable algorithms
- If the consumer of your data has different algorithms there is no
best answer (no magic)
- Profiler tools, 2 types:
- Vtune from Intel - uses internal processor counters, gprof is
in the same category
- Callgrind - open source and simulation a processor, downs side
is this is slow
Don’t share data across cores, spread them across threads, another
reason to use stack variables
- Is deep inheritance expensive? - yes if it virtual , avoid
multiple inheritance , do use the initializer list with empty
constructor bodies, this removes initialization ordering fragility
- Branch predictions and look ahead should be automatic with good
code , why don’t we have good code? Servesh says we are
overwhelming the front end with lots of instructions. If we moved
the code to be SIMD efficient it would alleviate this problem
automatically.
- How do you strike a balance between reliability and optimization?
A: With the tools available today, there is no longer a conflict.
You can have fast and readable code.
- Don’t be afraid of specializing templates
- Don’t be afraid to rewrite your prototype code
- Understand the limitations of the abstraction and concept
programming you do
- Shouldn’t you code for the common case first not the corner one?
Yes but remember the memory wall and think about how your
algorithms use the data first.
- Do code reviews help in readability? Experts think yes.
Sherpa Status and Plans - Marek Schonherr, CERN
- The framework:
- Available physics simulations
- Changing scales and PDFs / alphaS(massZ)
- Code structure diagram - modular component structure loaded
dynamically - same technology used for user definable functions
- Run Strategy 1. Initialization, 2. Integration, 3. Event
generation
- The first two steps can be done once and results are platform
independent
- Step three is done on the grid
- The second step is MPI parallizable
- Code performance tables
- Confident that x10 size performance improvements are possible
for the worst case NLO sample
- There are places where small compromises in physics
performance can lead to major technical performance
improvements
- Walk through callgrind plots - at LO unweighting procedure
dominates time - at NLO also dominated by unweighting and matching
- Weight distributions and how they relate to performance - need a
better approximation of integrand
- If we have to fall back to LO because of performance it is better
to use the HT prime approximation
- Can there be a way to use static linking when in production and
don’t need the flexibility? A: it can be done but Taylor found
that he only gained 20% (doesn’t sound bad to Markus)
aMC_@NLO Status and Plans - Olivier Mattelaer
- The uncertainty @NLO - it is possible to trade speed for disk
space, allow reweighting with future pdf
- Limitations of the re-weighting
- Re-weighting for mass scan
- NLO re-weighting
- MadGraph is compiled with (-01) but not much better at higher
level (tested at O3)
- -Ofast makes it 30% faster at LO/NLO -> however it requires
validation since messes with the math functions
- MPI strategies
- Timing results and resulting mpi scaling as a function of the
number of ranks - there is a point where you lose by adding ranks
- LO doesn’t scale to higher then 500-200 rank however it doesn’t
need high bandwidth communication so can run on a tier2
- HTC vs HPC chart - argues that HTC is better for constant cost ->
started discussion of owned vs. borrowed resources
- GPUs - plots comparing CPU to GPU performance as a function of the
number of jets, also as a function of number of gluons in the
final state
- CPU vs. GPU - no clear winner but likely winner is GPU
- Q: Is the MG5 port to CUDA in 2011 still working? A: yes, could be
improved
- C: very good, we need any code running on GPUs to exploit
latest HPCs
- Q: Is MPI only for integration? Generation? Or both? A: it is only
for the integration
PYTHIA 8 Status and Plans - Philip Ilten
- History - lasted release will appear in March, another planned
- Documentation - lots of references
- Decided to be user-friendly
- Focuses on everything but ME generation
- Phythia is thread-safe except for adaptive maximum
- Special constructors for multiple instances that save time
- External pointers in the interface allowing lots of customization,
even hooks for Heavy Ion generation
- User defined hooks can interrupt and customize different parts of
the internal work flow of Pythia
- Supports both HepMC2 and HepMC3 and can read in LHEF, SLHA
- News -
- HI support with collective effects and alternate models for
string fragmentation
- Developed by Leif Lonnblad and Christian Bierlich JHEP
1810 (2018) 134
- Can use gridpacks
- Full integration of DIRE and VINCIA parton showers
- Lots more see slide 11
- DIRE from Stefan**2
- VINCIA by Peter Skands and Nadine Fischer
- Shower Variations by Stephen Mrenna and Peter Skands
- Dark Matter Models by Nishita Desai
- Deuteron Production by Phil Ilten
- Merging algorithms are slow and they are working to improve them
almost a bug fix
- Phythia CPU consumption may be small but it is used everywhere so
the integral can be significant - can you work on the MPI? A: Yes
- LHC specificity happened in the transition from P6 -> P8
Herwig Status and Plans - Peter Richardson, CERN
- Designed for hadronic physics
- Strong emphasis on accurate simulation of QCD
- Became unmaintainable so moved to C++ with Herwig7 - understanding
the code was of high value
- Tried to balance performance vs physics and maintainability
- They rely on externals much less because it turned out that the
interfaces to externals were more prone to error
- Structure chart
- Done with fortran conversion
- Improved NLO matching
- Work on uncertainties and reweighting
- Improvements in soft physics
- Recent work on Baryonic Colour Reconnection
- Future
- All of these are physics improvements - that’s what gets noticed
in the theory community and gets funding and jobs
- In Herwig the biggest speed/memory improvement would be to improve
NLO
POWHEG Status and Plans - Emanuele Re, CERN & LAPTh
- Status of the repository
- A legacy version in svn
- V2 also in svn but more modular components
- RES with just a few processes
- Powheg needs external parton shower
- Recent physics Applications
- W+W- very computationally expensive for decays at NNLO+PS
- More examples on slide 3
- Default output is unweighted events - at times weighted events are
needed
- Negative weights can appear despite the acronym in the name
- Can be limited through “folding” -> runs become longer as
number of calls to real matrix elements increases
- Folding costs time but for complicated processes it is a
balancing act
- Different steps can be parallelized but there is a synchronization
point between steps
- Can detect “bad” (unstable points) grids to avoid including them
in the next step
- … more facilities on 5
- Some facilities come from interaction with experiments and this
will likely continue
- No effort to support new architectures or increased
parallelization
- NNLOPS reweighting works but it’s CPU intensive
- MiNLO merging for processes with light partons in the final state
- there is a proposal that suggests a possible avenue for further
progress
Discussion
- Josh: is benchmarking exercise redone for each release? (a general
audience questions) - it’s useful, we hope that support for this
will continue
- Can there be unit tests for physics validation?
- Yes, this is done for Herwig made difficult by multiple
changes occurring per release which is benchmarked
- Validation discussion
- As complex for the generators to evaluate expected changes as
for the experiments
- High statistics are often needed - implying a large resource
pool
Tuesday AM
High-multiplicity multi-jet merging with HPC technology
- Traditional approaches to generating high-multiplicity jets does
not scale.
- HDF5 is used as it’s optimal for parallel access on HPCs.
- Processing compressed HDF5 files is x2 faster than the
uncompressed files.
- Each jet multiplicity is done explicitly and separately, but they
do need to be added together (merging with CKKW-L).
- Scaling is good, but not perfect - still being investigated.
- Q. HSF is working on “ferries” to transform between data types -
HDF5 seems very useful for LHE data, can this be made available
as a library?
- Conversion of 65M event files from LHE XML to HDF5 took 17
hours (single thread). (Note: that’s 1kHz, ~1MB/s.)
- Added direct HDF5 output as a plugin for Sherpa, will be in
future releases.
- SciDAC projects have to produce “open source” products.
- Q. What inter-rank communications is there? A. For particle level
more or less none. For the ME this is done using existing Sherpa
MPI implementation.
- Q. Is it time to update the LHE format?
- SP. Pythia has added direct support for reading these HDF5 files,
will be in a future release (once stabilized).
- FS. Can move the expensive parts (ME) to HPC, then just do the
particle level within the experiment software. Good also for
sharing ME between experiments and with the theory community. How
reusable are the MEs? A. Could tighten cuts, but would become a
bit less efficient. However, quite some variations in what can be
done at the particle level (blue), so expensive ME generation
(red) can be reused, saving a lot of
time.
A novel workflow of generator tunings in HPC for LHC new physics searches
- Q. HPC specific part is generating MC events, which is trivially
parallel? A. Yes, no parallel communication, so HTC would also be
fine for this.
- Q. Do you need load balancing between the ranks? A. This is
handled by the DIY toolkit (also used in previous example). Phil
noted that imbalances between per-event times are a real issue, so
this kind of dynamic balancing is very useful.
- Advantage of load balancing on a cluster is that far less
bookkeeping is needed in the end (cf. really independent jobs +
merging on the grid).
- Q. Did you consider Bayesian optimisation method, which offers
advantages of priors, etc.? A. Yes, it’s been considered, but not
yet really explored.
Adaptive multi-channel integration with MPI
- Implementing MPI is not just a technical task - one needs to
understand the program’s purpose, so needs physics input.
- Use MPI features - they work very well and are written and
supported by experts (don’t reinvent this wheel, you will fail!)
- Shared cluster made benchmarking difficult.
- Hybrid parallelisation provides an extra speed-up (OpenMP + MPI).
- Q. Does the integration do adaptive scheduling? A. No. Would be a
future improvement.
- Q. How difficult is it to collect the grids? Minor grouping or
just one big reduction? A. It’s just done on the master, so that
is a slow part (serial) at the end.
- OpenMP/MPI is also useful for NUMA architectures.
Optimising Generator Usage
- 1 - Reduce CPU requirements by running large scale event
generation at LO
- FS. Q. If things are flat in phase space it’s easy, but if not
flat then each analysis needs to apply their own corrections
(per analysis). So wouldn’t you need N(N)LO at the detector
level anyway? So what is the gain?
- Can save a factor x10 in CPU. (FS was sceptical)
- No negative weights at LO.
- Generation itself is also faster.
- NLO merged you can do at the truth level.
- Stefano. Q Generating 8 jets is technically impressive, but is
it needed - problems in the merging? A. Yes, the merging needs
to logarithmically be as exact as the ME.
- FS - there was some work that showed this is under
control.
- Can use Holger’s work to do cross checks.
- StefanH - NNLO is coming, but will not be as efficient as LO
simulations.
- Stefano - not saying anything about NNLO. NNLO will be
doing a small number of processes only. LO can do
everything.
- NLO matched calculations do exist for inclusive processes.
MC limited, not ME limited.
- 2 - Reduce computing costs by sharing unweighted matrix-element
event samples between experiments
- Few input parameters.
- Q. Josh - what are technical hurdles for sharing files between
CMS and ATLAS? A. Liz, did not work in the past. Now can share
this in the same way that we do for OpenData, so technical
limitations not there anymore.
- Just the computational part - not the
showering/hadronisation, so storage requirements are
minimal.
- Q. Does it affect how experiments run jobs? A. Not much -
changes the “splitting” line a bit.
- Q. Don’t different generators do this already? A. In some
areas. Would want this to be completely open, feed back to
theorists.
- Want to share more widely, but LHCb are in a different region
of phase space, not clear real savings would happen.
- Can also share intermediate pieces of computation.
- Q. Who would actually do this? How to get credit for it. A.
Give DOIs to the data (CERN product - gives attribution).
Figure out how to actually fund people.
- Q. What about cross-checks? Experiments do this today
because they are independent. Past experience shows that these
checks are important. A. Yes, cross-checks are important. SH -
can do this with different pieces of samples. Q. But what
about different physics? A. Would have these differently for
each generator anyway, so that allows for checking across
different generators. This also alleviates the technical
difficulties. Josh - we don’t cross check at all at the
generator level today, so we’re not really losing much.
- Q. Generating event weights is fast for LO, not much saved
(detector simulation is the expensive bit). Different for
N(N)LO though. A. Parton level calculation only - can, e.g.,
combine samples.
- Q. Intermediate suggestion - common configuration of cards.
Share these as they are difficult to get right. A. Marek - for
Sherpa this is no-brainer. But traditionally ATLAS and CMS
could not agree on a common tuning for Pythia.
- Josh - scrutiny should actually go up! We would have two sets
of eyes on the setup. Simone - without this we may be
limited in N(N)LO computations and would lose physics. Frank
- having more theory input helps here too. Peter - beware of
diminished responsibility.
- Q. Andrea - does this help with correlation of systematics?
Yes. But the shower tunings also important.
- Economics of doing this? People are not being “paid” to do
this - just part of their physics analysis. What’s the
motivation to help the other experiment? Don’t want to lose
the analyst support. Josh - not sure this distinction is true.
MC convenors are analysts. Davide - everyone needs the common
samples (W+jets). Common samples will be managed “centrally”.
Zach - V+Jets, Single Top, and done - not so many samples.
- Simone - sociologically it would seem to work better to do a
common sample rather than share existing things. Get buy-in
from people who will use this.
- 3 - Other topics
- Technical RAs making “deep” contributions into MC teams.
- Zach - experiments are not dripping with software experts.
What person would do this job? Need a career path.
- Liz - another way to help would be to modernise software
practice, open up code to allow pull requests. A. Code has
always been open. Current version of HEPForge should
support direct pull requests. Andy - communication is
important, needs to be a conversation between main
developers and submitter. Pull requests out of the blue
are not desirable, given complexity of the systems and
potential side-effects.
- Servesh - need a dialog between engineers and physicists.
- Andy (on Vidyo): “Since time & question bandwidth is
limited, I may as well write here that I don’t think
pull-request GUIs are a silver bullet for getting
experiment RA/sw engineering contributions to MC
tools! They’re nice, but a minor effect IMHO.
- The sw/performance issues are *big*, and require
significant time investments. If this work isn’t
valued and rewarded, most will not bother… the
social & career incentive obstacles are much bigger
than the mechanism for providing patches.
- Plus, understanding these codes in sufficient detail
to make significant changes is not something that
comes without a detailed & lengthy conversation with
the authors.
- Better alignments to needs of experiments
- Ease of use is a big point to improve on.
- Andy - ATLAS are generating a lot of samples with heavy
filters - not very efficient. Generator side improvements
(biasing) would help a lot.
- Marek - had a student to work on common API, but problem
is that it’s only applicable to simplest processes.
- Josh B - analysis groups will do as little work as
possible (even if it’s wasteful). Need to push groups to
do this and also teach people how to use the generator
functionality properly. Zach - examples of issues from
ATLAS could be given back to theory people and used to
improve the documentation. Marek - need the documentation
there, but people also need to read it (!) to avoid
inefficient setups.
- Zach - would be great to have a “production mode” switch
for generators (like Madgraph), turn off all the bells and
whistles.
Photos and Tauola Status and Plans
- Q. Is there any plan for being thread-safe? A. Initialisation is
done in one place - common block. Can pass that to different
“workers”. Not too worried about speed, it’s fast.
- Liz - experiments can deal with pattern of initialisation serial,
but event loop needs to share data read-only. A. There is one
place where data would be shared (histogram); different random
number generators can be plugged in already.
EvtGen Status and Plans
- EvtGen written and maintained by experimentalists.
- Modernisation from Gerhard has now overwhelmed GitLab web
interface (!) - many improvements!
- Multi-threading is next, will require new interface to the random
number generator
- HepMC 3 transition will take a few more months and help with
threading issues.
- Pythia8 could be used to replace Tauola; but Photos functionality
is really needed, no current alternative.
- Profiling shows lots of low hanging fruit for investigation
- Dealing with HepMC files (used in Photos interaction) takes a lot
of time.
- Q. Z - which version of photos and tauola are used? A. Think it is
the latest one.
- Changing interface to “shared” object would be very hostile
for threading.
- Two postdocs will contribute to multi-threading conversion effort,
as well as some help from Michel.
- Q. Zach - How do you get the decay table?
- PDG tables cannot be used directly in a computer problem
(~disputed).
- Problem with branching fractions.
- Is this something HSF can help with?
Tuesday PM
NNLO / N^3LO calculations with NNLOJet
- NNLOJet is framework for NNLO calculations using Antenna
subtraction - fortran + openmp + python, depends on LHAPDF.
- Three processing phases, but dominated by random-number seed
driven “production” phase -
- Jumps to N3LO from NNLO increases CPU by 2-3 orders of magnitude.
- How does this get then used by experiments (eg, to put 100k hours
in perspective) - experiments answer that they don’t this in
production - may be via analyst driven work. Sounds like this
might evolve towards HL-LHC - eg, precision needed in MC event
generation. [reweighting + final theory comparisons as opposed to
MC events]
- Q/Andrea: This is not unweighted. How will experiments use
this? My mental model now is we will do LO unweighted (ok),
some NLO unweighted (problem with -ve weights, but okish),
plus complementary info from NNLO, eg some distributions with
very fine binning for some specific analyses. Is this a
correct model? A/Simone: yes now this is what we can do, but
in 10 years we will probably have this NNLO unweighted and
this will be the bottleneck. C/Andrea: yes but the question is
whether the NNLO unweighted will be money affordable at that
time…
- Scales “badly” with # of particles. Something to follow
- HPC discussion suggests that this sort of calculation is a good
use case.
- CERN’s collaboration with PRACE may be useful here
NNLO calculations with Matrix
- Matrix requirement sare LHAPDF and numpy (plus more dependencies
inside)
- Parallelizm is across processes, batch support built in
- Nice comparison of CPU vs uncertainty - bottom line - days @ NNLO
vs minutes @ NLO vs seconds @ LO
NNLO calculations with MCFM
- Linear scaling via MPI - 1% precision on NNLO cross section O(1
hour) on desktop. CPU for processes with jet O(1000 hours)
Beyond Current Paradigms
- Fixed order calculations - cross sections, not really events
anymore. Or are we just doing “more of the same”?
- Large production runs will probably become inaccessible to
theorists.
- Andrea - will probably always need unweighted events with detector
simulation. Driven by physics.
- Lattice get specific grants to do specific calculations -
something like this probably becomes the norm. Would be infrequent
and require some community agreement.
- Accuracy of the calculations - if something takes 6 months to run
then this is probably not debuggable - then not trustable. If
parallalisable then it could easily foresee 10k cores for some
hours.
- HSF/LPCC group proposal - identify interested people to build on
the sort of discussions during this workshop.
- Andrea volunteered with computing resources point of view
-
- Unfolding - experiments have very different needs in this area.
Not clear if increasing order helps - often harder to match to
data. Recent ATLAS paper (Hbb?) just used Pythia8, but with
sophisticated unfolding.
- Key point is to describe the data well - if LO does that then
could be enough.
- Should be testing unfolding in a number of different
scenarios.
- Should be careful about what we mean by “right” - uncertainties
always exist and we have to balance them. Recent changes towards
investment in understanding theory systematics
- Inclusive cross sections and unweighted events have both proved to
be useful in LHC analysis.
- Landscape changing - era of ‘fast’ searches are maybe coming to an
end.
- Back to negative weights - can we judge if we are happy with 20%
negative weights? Introduces the discussion of reweighting schemes
and their use for systematics. When are they robust enough to
avoid sending events through G4 simulation.
- Lots of theory advances - this workshop should focus on areas
where improving the codes’ engineering will help.
- Need people to work together on this problem.
- Testing important.
- If there’s a need there should be a reward!
- Codes are evolving all the time - not clear that the languages
used are optimal, but that’s the code we have. Be simple.
Take Home Messages / Actions / Planning
- Generator performance comparisons work
- Broaden this activity
- Understand differences between ATLAS and CMS - e.g. two things
to understand
- Comparison of sherpa and MG for the same process
- Are the two experiments producing very different types of
samples or configurations? Is ATLAS producing more or more
expensive events?
- LHCb is different from ATLAS/CMS as generation focuses on decays
- Regression tests (for physics results AND timing performance) and
best code practice
- The HSF can help with this (a lot of technical experts)
- Can the experiments help with their infrastructure for timing
regression?
- Projection of CPU needs for Run4
- The fractional budget of event generation will increase as
experiments move to fast simulation and faster reconstruction
- Model for use of LO, NLO and NNLO?
- Best ways to support technical work in generator field
- Experiments
- Engineers (how to fund this - grant applications?)
- Prospects for ‘permanent’ jobs (i.e. not just two year grants)
- Generators critical part of what experiments do. Aware that
“reward system” has limitations, need to improve on this.
- Physics improvements are what gets noticed in the theory
community and therefore gets funding and jobs. Example
speed/memory improvement in Herwig (matrix elements and NLO
integration) would be “many years of work for little
recognition”. (see Peter’s presentation)
- GPU ports / code portability? Are experiments interested in
using/validating, particularly Madgraph.
- Supercomputers/HPCs are going more and more towards GPUs.
- A dedicated performance workshop in the future (Spring?)
- Say 5 days of real hands-on work
- Negative weights are a problem (but only in NLO, there are none in
LO). These come from cancellations of real and virtual
contributions and are unavoidable. “Unweighted” events with
negative weights means that events have weight +1 or -1
essentially.
- Need better metrics to understand the impact of negative
weights. Example: scalar metrics integrating the knowledge of
the weight distribution and also the relative cost (in CPU
budget) of detector simulation vs generation.
- Can a common hook be added to generators so that we can set
generator level cuts in a standard way?
- HSF has Working Groups in important areas of work for HEP. This
area could benefit from the same.
- Some convenors willing to carry this forward and organise?
- Collaboration with LPCC has been very fruitful - should be
joint activity