hpc and data management for hep - oak ridge …...demo • using existing art framework-related...

UsingObjectstoreforHEPdataHEPnOS

HEPDatamanagement

HPCanddatamanagementforHEPSciDAC HEPDataAnalyticsonHPC

N.Buchanan2,S.Calvez2,P.Carns1,P.F.Ding4,M.Dorier1,D.Doyle2,A.Himmel4,J.Kowakowski4,R.Latham1,M.Paterno4,A. Norman4,S.Sehrish4,A.Sousa3,S.Snyder1,R.Ross1

1ArgonneNationalLaboratory, 2ColoradoStateUniversity,3UniversityofCincinnati,4FermiNationalAcceleratorLaboratory

NO𝜈AEventSelection

• EventsarestoredinaROOTn-tupleformat.• ~180thousandROOTfiles.• FilesizesrangefromafewhundredKiBtoa

fewMiB;thefulldatasetis~3TiB.• Stricteventselectioncriteriatoreduce

millionsoftheseseventstoafewtens.

UsingHPCforParallelEventSelection

FIG.1:Sciencecontextandtraditionalsolution

2018 Scientific Discoverythrough Advanced Computing (SciDAC-4)

Principle Investigator Meeting, Rockville, MDhttp://computing.fnal.gov/hep-on-hpc/

• ExperimentalHEPdealswithcomplexBigDatafromuniquedetectors.–Already~10sofPBforCMSatUScenters,expectedtogrowby2ordersofmagnitudeduringHL-LHCera–~17PB/yearforDUNE• Analysistasksandthereforeanalysissoftwareisalsocomplex.• Theanalysissoftwareiswrittenbythousandsofphysicists.•Weprovideamodularframeworkanddatamodeltosupporttheircollaboration.•WeareexploringthedesignofanobjectmodelthatleveragesHPCmachinefeatures.

Model(neutrino and anti-neutrino)

Event selection and data reduction

Neutrino candidates(background, predictions,

observations)

Neutrino events, background, predictions

Parameter estimation

Reconstruction output

Particle and event ID

Simulation output

Particle and event ID

n-dimensional contour plots

Data store: Tens of thousands of CAF files in dcache or tape

Data size:~TBs

ROOT macro run as tens of thousands of individual jobs

on grid

Data size: few tens of KBs

• Thensummarizethesetensofeventstobeusedtheextractionoftheneutrinooscillationparameters.

• Goal:reducethetimeittakestoprocessanalysis-leveldata.

MeasurementoftheneutrinooscillationparametersbytheNO𝜈Acollaboration,PRL118,231801(2017).• Sampleof~27millionreconstructedspillstosearchforelectron-

neutrinoappearanceevents.

CN CN CN CN

CN CN CN CN

CN CN CN CN

CN CN CN CN

srun -n 76800 shifter \ python process_nova.py novacaf.h5

High bandwidth global file system

Login Node

BB BB BB BB

Compute nodes connected Aries

high-speed network

novacaf.h5striped

High bandwidth burst buffers

MostofthecomputingresourcesthatwillbeavailabletouswillbeatHPCcenters.• ModifyourworkflowsandcodetotakefulladvantageofHPCsystems.• Minimizereading,communication andsynchronizationbetweenprocesses,parallelismis

implicit;user-writtencodelooksjustlikeserialcode.• Processalldataforagivenslice inasingleMPIrank,theslice isNOvA’s “atomic”unitof

processing,likeacolliderevent.• OrganizethedataintoasingleHDF5file,containingmanydifferenttables.

~20,000 files

art/ROOT

~20,000 files

HDF5

~20,000 art batch jobs on grid resources

~20,000 files

HDF5

Transfer files toNERSC using grid tools

1 file

HDF51 MPI job toconcatenate~20,000 files

SummaryandFuturework• NOvA istakingownershipofourHDF“ntuple”productioncode.• Wewillbedoinglarge-scaleperformancetestingofthecode.– SimilardesignprocessingLArIAT datademonstrateperfectscalingto76,800ranks;read anddecompress 42TBofdatain<20secondswall-clocktime.

• WewillbecomparingperformancewithDIYC++14implementation.• IntegrationwithlargerworkflowusingDecafthatisalsopartoftheSciDAC project.– useofchangesineventselectioncriteriatoevaluationsystematicuncertaintiesinthemixingparametermeasurements.

– oneintegratedMPIprogram,totakebestadvantageofHPCplatform.

def kNueSecondAnaContainment(tables):df = tables['sel_nuecosrej']return (df.distallpngtop > 63.0) & \

(df.distallpngbottom > 12.0) & \(df.distallpngeast > 12.0) & \(df.distallpngwest > 12.0) & \(df.distallpngfront > 18.0) & \(df.distallpngback > 18.0)

def vtxelasticzCut(tables):df = tables['vtx_elastic']df['good'] =

(df.vtxid == 0) & (df.npng3d > 0)KL = ['run', 'subRun', 'event', 'slice']return df.groupby(KL)['good'].agg(np.any)

Listing1.Selectiononmultiplecolumnsofatable

Listing2.Groupby andAggregationexampleDistributingandreadinginformation• Eachrankreadsits“fairshare”ofindexinfofromeachtable.–identifieswhichrankshouldhandlewhichevent,formostevenbalance–identifiesrangeofrowsintablethatcorrespondtoeachevent(allslices)

• Event“ownership”informationdistributedtoallranks.–thisassuresnofurthercommunicationbetweenranksisneededwhileevaluatingtheselection.criteriaonaslice-by-slicebasis.–perfectdataparallelisminrunningallselectioncode.

• Eachrankreadsonly relevantrowsofrelevantcolumnsfromrelevanttables.–allrelevantdatareadbysomerank.–norankreadsthesamedataasanother.

sel_nuecosrej vtx_elastic

rank 0

rank 1

Index info read by rank 1Table row read by rank 0

run subRun event slice distallpngtop ...35 more…

16433 61 356124 35 nan16433 61 356124 36 -0.7401374616433 61 356124 37 nan16433 61 356125 1 nan16433 61 356125 2 423.633716433 61 356125 3 -2.849864

run subRun event slice vtxid npng3d …6 more…16433 61 356124 35 0 016433 61 356124 36 0 116433 61 356124 36 1 116433 61 356124 36 2 516433 61 356125 1 0 116433 61 356125 3 0 0

Table2.Thistablehasoneentrypervertex.Somesliceshavenone(anddonotappearinthistable).Somesliceshavemanyvertices.

Table1.Thistablehasoneentryperslice.Wearedoingslice-by-slice selection.

hepnos::DataStore datastore(configfile);for (auto it = datastore.begin(); it != datastore.end(); ++it) for (auto const& ds : *it) for (auto const& r : ds.runs()) for (auto const& sr : r) for (auto const& e : sr) {

std::vector<recob::Hit> hits;art::InputTag hit_tag("gaushit", "", "PrimaryReco");e.load(hit_tag, hits);// .. do something with the vector of Hits ..

}

SummaryandFuturework• WeareexploringhowtoputsupportforMPIparallelismdirectlyintotheinfrastructure.–Willenableourreconstruction/analysiscodestotakeadvantageofdataparallelismanddistributedprogrammingwithoutlookingmuchdifferentthanitdoestoday.

• WewilldeploytoNERSC(throughShifter)andALCF(throughSingularity)andperformscalabilitystudy.• Wewillbecomparingperformanceagainstcurrentmethodsofreadingandwritingdata.• IntegrationofobjectstorewithlargerworkflowutilizingDecafandenableitsuseatdifferentlevelsofprocessing.

Listing3.ExampleofHEPnOSuse:readingfromtheobjectstore

Theuserinterfacelooksalotlikeour

currentuserinterface

FIG.2:Workflowfortranslatingold-styledatatonew-styledata

FIG.3:Anexampleofdistributingandreading

Exampleofnew-styledataorganizationinHDF5

FIG.4:HPCsolution

FIG.6:CMSdetector

FIG.5:DUNE

FIG.8:HEPnOS

FIG.9:classrecob::Track

UsercodeExamples

HEPdataorganization● Acommondesignistohavelevelsofdataaggregation:dataset/run/subrun /event/dataproducts

● Ateachlevelofaggregation,thenameditemsareindependent.

● Datasetscancontainotherdatasets.● Distributionofdataautomatedbythedatamodelimplementation,notuser-visible.

art modules/usercodeAlgo 3Algo 1 Algo 2

Event“Proxy”

get<product>(key)

Source

Prepare“correct”proxy

HEPnOS C++APIload

Interactionwithproxy

InteractionwithHEPnOSUserpath

FIG.7:UserinteractionwithHEPnOS

BAKE SDS-KeyVal

HEPCode

RPC RDMA

PMEM LevelDB

C++API

FERMILAB-POSTER-18-097-CD

Acknowledgement:ThismaterialisbaseduponworksupportedbytheU.S.DepartmentofEnergy,OfficeofScience,OfficeofAdvancedScientificComputingResearch,ScientificDiscoverythroughAdvancedComputing(SciDAC)program,grant1013935.

FIG.10:classrecob::Hit

Goals● Managephysicseventdatafromsimulationandexperimentthroughmultiplephasesofanalysis

● Accelerateaccessbyretainingdatainthesystemthroughoutanalysisprocess

● ReusescomponentsfromMochi ASCRR&Dproject

Properties● Write-once,read-many● Hierarchicalnamespace(datasets,runs,subruns)● C++API(serializationofC++objects)

Components● Mercury,Argobots,Margo,SDSKV,BAKE,SSG● Newcode:C++eventinterface

MapdatamodelintostoresDemo• Usingexistingart framework-relatedsoftware:–gallery providesaccesstoeventdatainart/ROOTfilesfromoutsidetheart framework.–pre-existingbuildsoftheLArSoft dataproductlibraries.• UsingsomeofthedataproductclassesfromLArSoft(usedbyDUNE,etc.)– std::vector<recob::Hit>– std::vector<recob::Track>– art::Assns<recob::Hit, recob::Track>,whicharebidirectionalassociationsbetweenhitsandtracks.

• UsingDockercontainersforeasyportabilityofdevelopmentenvironment• Prototypetestprogramsarealreadyrunningto:– readfromexistingart/ROOTdatafiles,andtowritetothenewdatastore– readfromthenewdatastore,andverifytheintegrityofthedata

hpc and data management for hep - oak ridge …...demo • using existing art framework-related...

Documents