hpc and data management for hep - oak ridge …...demo • using existing art framework-related...
TRANSCRIPT
UsingObjectstoreforHEPdataHEPnOS
HEPDatamanagement
HPCanddatamanagementforHEPSciDAC HEPDataAnalyticsonHPC
N.Buchanan2,S.Calvez2,P.Carns1,P.F.Ding4,M.Dorier1,D.Doyle2,A.Himmel4,J.Kowakowski4,R.Latham1,M.Paterno4,A. Norman4,S.Sehrish4,A.Sousa3,S.Snyder1,R.Ross1
1ArgonneNationalLaboratory, 2ColoradoStateUniversity,3UniversityofCincinnati,4FermiNationalAcceleratorLaboratory
NO𝜈AEventSelection
• EventsarestoredinaROOTn-tupleformat.• ~180thousandROOTfiles.• FilesizesrangefromafewhundredKiBtoa
fewMiB;thefulldatasetis~3TiB.• Stricteventselectioncriteriatoreduce
millionsoftheseseventstoafewtens.
UsingHPCforParallelEventSelection
FIG.1:Sciencecontextandtraditionalsolution
2018 Scientific Discoverythrough Advanced Computing (SciDAC-4)
Principle Investigator Meeting, Rockville, MDhttp://computing.fnal.gov/hep-on-hpc/
• ExperimentalHEPdealswithcomplexBigDatafromuniquedetectors.–Already~10sofPBforCMSatUScenters,expectedtogrowby2ordersofmagnitudeduringHL-LHCera–~17PB/yearforDUNE• Analysistasksandthereforeanalysissoftwareisalsocomplex.• Theanalysissoftwareiswrittenbythousandsofphysicists.•Weprovideamodularframeworkanddatamodeltosupporttheircollaboration.•WeareexploringthedesignofanobjectmodelthatleveragesHPCmachinefeatures.
Model(neutrino and anti-neutrino)
Event selection and data reduction
Neutrino candidates(background, predictions,
observations)
Neutrino events, background, predictions
Parameter estimation
Reconstruction output
Particle and event ID
Simulation output
Particle and event ID
n-dimensional contour plots
Data store: Tens of thousands of CAF files in dcache or tape
Data size:~TBs
ROOT macro run as tens of thousands of individual jobs
on grid
Data size: few tens of KBs
• Thensummarizethesetensofeventstobeusedtheextractionoftheneutrinooscillationparameters.
• Goal:reducethetimeittakestoprocessanalysis-leveldata.
MeasurementoftheneutrinooscillationparametersbytheNO𝜈Acollaboration,PRL118,231801(2017).• Sampleof~27millionreconstructedspillstosearchforelectron-
neutrinoappearanceevents.
CN CN CN CN
CN CN CN CN
CN CN CN CN
CN CN CN CN
srun -n 76800 shifter \ python process_nova.py novacaf.h5
High bandwidth global file system
Login Node
BB BB BB BB
Compute nodes connected Aries
high-speed network
novacaf.h5striped
High bandwidth burst buffers
MostofthecomputingresourcesthatwillbeavailabletouswillbeatHPCcenters.• ModifyourworkflowsandcodetotakefulladvantageofHPCsystems.• Minimizereading,communication andsynchronizationbetweenprocesses,parallelismis
implicit;user-writtencodelooksjustlikeserialcode.• Processalldataforagivenslice inasingleMPIrank,theslice isNOvA’s “atomic”unitof
processing,likeacolliderevent.• OrganizethedataintoasingleHDF5file,containingmanydifferenttables.
~20,000 files
art/ROOT
~20,000 files
HDF5
~20,000 art batch jobs on grid resources
~20,000 files
HDF5
Transfer files toNERSC using grid tools
1 file
HDF51 MPI job toconcatenate~20,000 files
SummaryandFuturework• NOvA istakingownershipofourHDF“ntuple”productioncode.• Wewillbedoinglarge-scaleperformancetestingofthecode.– SimilardesignprocessingLArIAT datademonstrateperfectscalingto76,800ranks;read anddecompress 42TBofdatain<20secondswall-clocktime.
• WewillbecomparingperformancewithDIYC++14implementation.• IntegrationwithlargerworkflowusingDecafthatisalsopartoftheSciDAC project.– useofchangesineventselectioncriteriatoevaluationsystematicuncertaintiesinthemixingparametermeasurements.
– oneintegratedMPIprogram,totakebestadvantageofHPCplatform.
def kNueSecondAnaContainment(tables):df = tables['sel_nuecosrej']return (df.distallpngtop > 63.0) & \
(df.distallpngbottom > 12.0) & \(df.distallpngeast > 12.0) & \(df.distallpngwest > 12.0) & \(df.distallpngfront > 18.0) & \(df.distallpngback > 18.0)
def vtxelasticzCut(tables):df = tables['vtx_elastic']df['good'] =
(df.vtxid == 0) & (df.npng3d > 0)KL = ['run', 'subRun', 'event', 'slice']return df.groupby(KL)['good'].agg(np.any)
Listing1.Selectiononmultiplecolumnsofatable
Listing2.Groupby andAggregationexampleDistributingandreadinginformation• Eachrankreadsits“fairshare”ofindexinfofromeachtable.–identifieswhichrankshouldhandlewhichevent,formostevenbalance–identifiesrangeofrowsintablethatcorrespondtoeachevent(allslices)
• Event“ownership”informationdistributedtoallranks.–thisassuresnofurthercommunicationbetweenranksisneededwhileevaluatingtheselection.criteriaonaslice-by-slicebasis.–perfectdataparallelisminrunningallselectioncode.
• Eachrankreadsonly relevantrowsofrelevantcolumnsfromrelevanttables.–allrelevantdatareadbysomerank.–norankreadsthesamedataasanother.
sel_nuecosrej vtx_elastic
rank 0
rank 1
Index info read by rank 1Table row read by rank 0
run subRun event slice distallpngtop ...35 more…
16433 61 356124 35 nan16433 61 356124 36 -0.7401374616433 61 356124 37 nan16433 61 356125 1 nan16433 61 356125 2 423.633716433 61 356125 3 -2.849864
run subRun event slice vtxid npng3d …6 more…16433 61 356124 35 0 016433 61 356124 36 0 116433 61 356124 36 1 116433 61 356124 36 2 516433 61 356125 1 0 116433 61 356125 3 0 0
Table2.Thistablehasoneentrypervertex.Somesliceshavenone(anddonotappearinthistable).Somesliceshavemanyvertices.
Table1.Thistablehasoneentryperslice.Wearedoingslice-by-slice selection.
hepnos::DataStore datastore(configfile);for (auto it = datastore.begin(); it != datastore.end(); ++it) for (auto const& ds : *it) for (auto const& r : ds.runs()) for (auto const& sr : r) for (auto const& e : sr) {
std::vector<recob::Hit> hits;art::InputTag hit_tag("gaushit", "", "PrimaryReco");e.load(hit_tag, hits);// .. do something with the vector of Hits ..
}
SummaryandFuturework• WeareexploringhowtoputsupportforMPIparallelismdirectlyintotheinfrastructure.–Willenableourreconstruction/analysiscodestotakeadvantageofdataparallelismanddistributedprogrammingwithoutlookingmuchdifferentthanitdoestoday.
• WewilldeploytoNERSC(throughShifter)andALCF(throughSingularity)andperformscalabilitystudy.• Wewillbecomparingperformanceagainstcurrentmethodsofreadingandwritingdata.• IntegrationofobjectstorewithlargerworkflowutilizingDecafandenableitsuseatdifferentlevelsofprocessing.
Listing3.ExampleofHEPnOSuse:readingfromtheobjectstore
Theuserinterfacelooksalotlikeour
currentuserinterface
FIG.2:Workflowfortranslatingold-styledatatonew-styledata
FIG.3:Anexampleofdistributingandreading
Exampleofnew-styledataorganizationinHDF5
FIG.4:HPCsolution
FIG.6:CMSdetector
FIG.5:DUNE
FIG.8:HEPnOS
FIG.9:classrecob::Track
UsercodeExamples
HEPdataorganization● Acommondesignistohavelevelsofdataaggregation:dataset/run/subrun /event/dataproducts
● Ateachlevelofaggregation,thenameditemsareindependent.
● Datasetscancontainotherdatasets.● Distributionofdataautomatedbythedatamodelimplementation,notuser-visible.
art modules/usercodeAlgo 3Algo 1 Algo 2
Event“Proxy”
get<product>(key)
Source
Prepare“correct”proxy
HEPnOS C++APIload
Interactionwithproxy
InteractionwithHEPnOSUserpath
FIG.7:UserinteractionwithHEPnOS
BAKE SDS-KeyVal
HEPCode
RPC RDMA
PMEM LevelDB
C++API
FERMILAB-POSTER-18-097-CD
Acknowledgement:ThismaterialisbaseduponworksupportedbytheU.S.DepartmentofEnergy,OfficeofScience,OfficeofAdvancedScientificComputingResearch,ScientificDiscoverythroughAdvancedComputing(SciDAC)program,grant1013935.
FIG.10:classrecob::Hit
Goals● Managephysicseventdatafromsimulationandexperimentthroughmultiplephasesofanalysis
● Accelerateaccessbyretainingdatainthesystemthroughoutanalysisprocess
● ReusescomponentsfromMochi ASCRR&Dproject
Properties● Write-once,read-many● Hierarchicalnamespace(datasets,runs,subruns)● C++API(serializationofC++objects)
Components● Mercury,Argobots,Margo,SDSKV,BAKE,SSG● Newcode:C++eventinterface
MapdatamodelintostoresDemo• Usingexistingart framework-relatedsoftware:–gallery providesaccesstoeventdatainart/ROOTfilesfromoutsidetheart framework.–pre-existingbuildsoftheLArSoft dataproductlibraries.• UsingsomeofthedataproductclassesfromLArSoft(usedbyDUNE,etc.)– std::vector<recob::Hit>– std::vector<recob::Track>– art::Assns<recob::Hit, recob::Track>,whicharebidirectionalassociationsbetweenhitsandtracks.
• UsingDockercontainersforeasyportabilityofdevelopmentenvironment• Prototypetestprogramsarealreadyrunningto:– readfromexistingart/ROOTdatafiles,andtowritetothenewdatastore– readfromthenewdatastore,andverifytheintegrityofthedata