hpc and data management for hep - oak ridge …...demo • using existing art framework-related...

1
Using Object store for HEP data HEPnOS HEP Data management HPC and data management for HEP SciDAC HEP Data Analytics on HPC N.Buchanan 2 , S. Calvez 2 , P. Carns 1 , P.F. Ding 4 , M. Dorier 1 , D.Doyle 2 , A. Himmel 4 , J. Kowakowski 4 , R. Latham 1 , M. Paterno 4 , A. Norman 4, S. Sehrish 4 , A. Sousa 3 , S. Snyder 1 , R. Ross 1 1 Argonne National Laboratory, 2 Colorado State University, 3 University of Cincinnati, 4 Fermi National Accelerator Laboratory NOA Event Selection Events are stored in a ROOT n-tuple format. ~180 thousand ROOT files. File sizes range from a few hundred KiB to a few MiB; the full dataset is ~3 TiB. Strict event selection criteria to reduce millions of theses events to a few tens. Using HPC for Parallel Event Selection FIG. 1: Science context and traditional solution 2018 Scientific Discovery through Advanced Computing (SciDAC-4) Principle Investigator Meeting, Rockville, MD http://computing.fnal.gov/hep-on-hpc/ Experimental HEP deals with complex Big Data from unique detectors. Already ~10s of PB for CMS at US centers, expected to grow by 2 orders of magnitude during HL-LHC era ~17 PB/year for DUNE Analysis tasks and therefore analysis software is also complex. The analysis software is written by thousands of physicists . We provide a modular framework and data model to support their collaboration. We are exploring the design of an object model that leverages HPC machine features. Model (neutrino and anti-neutrino) Event selection and data reduction Neutrino candidates (background, predictions, observations) Neutrino events, background, predictions Parameter estimation Reconstruction output Particle and event ID Simulation output Particle and event ID n-dimensional contour plots Data store: Tens of thousands of CAF files in dcache or tape Data size:~TBs ROOT macro run as tens of thousands of individual jobs on grid Data size: few tens of KBs Then summarize these tens of events to be used the extraction of the neutrino oscillation parameters. Goal: reduce the time it takes to process analysis-level data. Measurement of the neutrino oscillation parameters by the NOA collaboration, PRL 118, 231801 (2017). Sample of ~27 million reconstructed spills to search for electron- neutrino appearance events. CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN srun -n 76800 shifter \ python process_nova.py novacaf.h5 High bandwidth global file system Login Node BB BB BB BB Compute nodes connected Aries high-speed network novacaf.h5 striped High bandwidth burst buers Most of the computing resources that will be available to us will be at HPC centers. Modify our workflows and code to take full advantage of HPC systems. Minimize reading, communication and synchronization between processes, parallelism is implicit; user-written code looks just like serial code. Process all data for a given slice in a single MPI rank, the slice is NOvA’s “atomic” unit of processing, like a collider event. Organize the data into a single HDF5 file, containing many different tables. ~20,000 files art/ ROOT ~20,000 files HDF5 ~20,000 art batch jobs on grid resources ~20,000 files HDF5 Transfer files to NERSC using grid tools 1 file HDF5 1 MPI job to concatenate ~20,000 files Summary and Future work NOvA is taking ownership of our HDF “ntuple” production code. We will be doing large-scale performance testing of the code. Similar design processing LArIAT data demonstrate perfect scaling to 76,800 ranks; read and decompress 42 TB of data in < 20 seconds wall-clock time. We will be comparing performance with DIY C++ 14 implementation. Integration with larger workflow using Decaf that is also part of the SciDAC project. use of changes in event selection criteria to evaluation systematic uncertainties in the mixing parameter measurements. one integrated MPI program, to take best advantage of HPC platform. def kNueSecondAnaContainment(tables): df = tables['sel_nuecosrej'] return (df.distallpngtop > 63.0) & \ (df.distallpngbottom > 12.0) & \ (df.distallpngeast > 12.0) & \ (df.distallpngwest > 12.0) & \ (df.distallpngfront > 18.0) & \ (df.distallpngback > 18.0) def vtxelasticzCut(tables): df = tables['vtx_elastic'] df['good'] = (df.vtxid == 0) & (df.npng3d > 0) KL = ['run', 'subRun', 'event', 'slice'] return df.groupby(KL)['good'].agg(np.any) Listing 1. Selection on multiple columns of a table Listing 2. Groupby and Aggregation example Distributing and reading information Each rank reads its “fair share” of index info from each table. identifies which rank should handle which event, for most even balance identifies range of rows in table that correspond to each event (all slices) Event “ownership” information distributed to all ranks. this assures no further communication between ranks is needed while evaluating the selection. criteria on a slice-by-slice basis. perfect data parallelism in running all selection code. Each rank reads only relevant rows of relevant columns from relevant tables. all relevant data read by some rank. no rank reads the same data as another. sel_nuecosrej vtx_elastic rank 0 rank 1 Index info read by rank 1 Table row read by rank 0 run subRun event slice distallpngtop ...35 more… 16433 61 356124 35 nan 16433 61 356124 36 -0.74013746 16433 61 356124 37 nan 16433 61 356125 1 nan 16433 61 356125 2 423.6337 16433 61 356125 3 -2.849864 run subRun event slice vtxid npng3d …6 more… 16433 61 356124 35 0 0 16433 61 356124 36 0 1 16433 61 356124 36 1 1 16433 61 356124 36 2 5 16433 61 356125 1 0 1 16433 61 356125 3 0 0 Table 2. This table has one entry per vertex. Some slices have none (and do not appear in this table). Some slices have many vertices. Table 1. This table has one entry per slice. We are doing slice-by-slice selection. hepnos::DataStore datastore(configfile); for (auto it = datastore.begin(); it != datastore.end(); ++it) for (auto const& ds : *it) for (auto const& r : ds.runs()) for (auto const& sr : r) for (auto const& e : sr) { std::vector<recob::Hit> hits; art::InputTag hit_tag("gaushit", "", "PrimaryReco"); e.load(hit_tag, hits); // .. do something with the vector of Hits .. } Summary and Future work We are exploring how to put support for MPI parallelism directly into the infrastructure. Will enable our reconstruction/analysis codes to take advantage of data parallelism and distributed programming without looking much different than it does today. We will deploy to NERSC (through Shifter) and ALCF (through Singularity) and perform scalability study. We will be comparing performance against current methods of reading and writing data. Integration of object store with larger workflow utilizing Decaf and enable its use at different levels of processing. Listing 3. Example of HEPnOS use: reading from the object store The user interface looks a lot like our current user interface FIG. 2: Workflow for translating old-style data to new-style data FIG. 3: An example of distributing and reading Example of new-style data organization in HDF5 FIG. 4: HPC solution FIG. 6: CMS detector FIG. 5: DUNE FIG. 8: HEPnOS FIG. 9: class recob::Track User code Examples HEP data organization A common design is to have levels of data aggregation: data set / run / subrun / event / data products At each level of aggregation, the named items are independent. Data sets can contain other data sets. Distribution of data automated by the data model implementation, not user-visible. art modules/user code Algo 3 Algo 1 Algo 2 Event “Proxy” get<product>(key) Source Prepare “correct” proxy HEPnOS C++ API load Interaction with proxy Interaction with HEPnOS User path FIG. 7: User interaction with HEPnOS BAKE SDS-KeyVal HEP Code RPC RDMA PMEM LevelDB C++ API FERMILAB-POSTER-18-097-CD Acknowledgement: This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program, grant 1013935. FIG. 10: class recob::Hit Goals Manage physics event data from simulation and experiment through multiple phases of analysis Accelerate access by retaining data in the system throughout analysis process Reuses components from Mochi ASCR R&D project Properties Write-once, read-many Hierarchical namespace (datasets, runs, subruns) C++ API (serialization of C++ objects) Components Mercury, Argobots, Margo, SDSKV, BAKE, SSG New code: C++ event interface Map data model into stores Demo Using existing art framework-related software: gallery provides access to event data in art/ROOT files from outside the art framework. pre-existing builds of the LArSoft data product libraries. Using some of the data product classes from LArSoft (used by DUNE, etc.) std::vector<recob::Hit> std::vector<recob::Track> art::Assns<recob::Hit, recob::Track>, which are bidirectional associations between hits and tracks. Using Docker containers for easy portability of development environment Prototype test programs are already running to: read from existing art/ROOT data files, and to write to the new data store read from the new data store, and verify the integrity of the data

Upload: others

Post on 08-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPC and data management for HEP - Oak Ridge …...Demo • Using existing art framework-related software: –galleryprovides access to event data in art/ROOT files from outside the

UsingObjectstoreforHEPdataHEPnOS

HEPDatamanagement

HPCanddatamanagementforHEPSciDAC HEPDataAnalyticsonHPC

N.Buchanan2,S.Calvez2,P.Carns1,P.F.Ding4,M.Dorier1,D.Doyle2,A.Himmel4,J.Kowakowski4,R.Latham1,M.Paterno4,A. Norman4,S.Sehrish4,A.Sousa3,S.Snyder1,R.Ross1

1ArgonneNationalLaboratory, 2ColoradoStateUniversity,3UniversityofCincinnati,4FermiNationalAcceleratorLaboratory

NO𝜈AEventSelection

• EventsarestoredinaROOTn-tupleformat.• ~180thousandROOTfiles.• FilesizesrangefromafewhundredKiBtoa

fewMiB;thefulldatasetis~3TiB.• Stricteventselectioncriteriatoreduce

millionsoftheseseventstoafewtens.

UsingHPCforParallelEventSelection

FIG.1:Sciencecontextandtraditionalsolution

2018 Scientific Discoverythrough Advanced Computing (SciDAC-4)

Principle Investigator Meeting, Rockville, MDhttp://computing.fnal.gov/hep-on-hpc/

• ExperimentalHEPdealswithcomplexBigDatafromuniquedetectors.–Already~10sofPBforCMSatUScenters,expectedtogrowby2ordersofmagnitudeduringHL-LHCera–~17PB/yearforDUNE• Analysistasksandthereforeanalysissoftwareisalsocomplex.• Theanalysissoftwareiswrittenbythousandsofphysicists.•Weprovideamodularframeworkanddatamodeltosupporttheircollaboration.•WeareexploringthedesignofanobjectmodelthatleveragesHPCmachinefeatures.

Model(neutrino and anti-neutrino)

Event selection and data reduction

Neutrino candidates(background, predictions,

observations)

Neutrino events, background, predictions

Parameter estimation

Reconstruction output

Particle and event ID

Simulation output

Particle and event ID

n-dimensional contour plots

Data store: Tens of thousands of CAF files in dcache or tape

Data size:~TBs

ROOT macro run as tens of thousands of individual jobs

on grid

Data size: few tens of KBs

• Thensummarizethesetensofeventstobeusedtheextractionoftheneutrinooscillationparameters.

• Goal:reducethetimeittakestoprocessanalysis-leveldata.

MeasurementoftheneutrinooscillationparametersbytheNO𝜈Acollaboration,PRL118,231801(2017).• Sampleof~27millionreconstructedspillstosearchforelectron-

neutrinoappearanceevents.

CN CN CN CN

CN CN CN CN

CN CN CN CN

CN CN CN CN

srun -n 76800 shifter \ python process_nova.py novacaf.h5

High bandwidth global file system

Login Node

BB BB BB BB

Compute nodes connected Aries

high-speed network

novacaf.h5striped

High bandwidth burst buffers

MostofthecomputingresourcesthatwillbeavailabletouswillbeatHPCcenters.• ModifyourworkflowsandcodetotakefulladvantageofHPCsystems.• Minimizereading,communication andsynchronizationbetweenprocesses,parallelismis

implicit;user-writtencodelooksjustlikeserialcode.• Processalldataforagivenslice inasingleMPIrank,theslice isNOvA’s “atomic”unitof

processing,likeacolliderevent.• OrganizethedataintoasingleHDF5file,containingmanydifferenttables.

~20,000 files

art/ROOT

~20,000 files

HDF5

~20,000 art batch jobs on grid resources

~20,000 files

HDF5

Transfer files toNERSC using grid tools

1 file

HDF51 MPI job toconcatenate~20,000 files

SummaryandFuturework• NOvA istakingownershipofourHDF“ntuple”productioncode.• Wewillbedoinglarge-scaleperformancetestingofthecode.– SimilardesignprocessingLArIAT datademonstrateperfectscalingto76,800ranks;read anddecompress 42TBofdatain<20secondswall-clocktime.

• WewillbecomparingperformancewithDIYC++14implementation.• IntegrationwithlargerworkflowusingDecafthatisalsopartoftheSciDAC project.– useofchangesineventselectioncriteriatoevaluationsystematicuncertaintiesinthemixingparametermeasurements.

– oneintegratedMPIprogram,totakebestadvantageofHPCplatform.

def kNueSecondAnaContainment(tables):df = tables['sel_nuecosrej']return (df.distallpngtop > 63.0) & \

(df.distallpngbottom > 12.0) & \(df.distallpngeast > 12.0) & \(df.distallpngwest > 12.0) & \(df.distallpngfront > 18.0) & \(df.distallpngback > 18.0)

def vtxelasticzCut(tables):df = tables['vtx_elastic']df['good'] =

(df.vtxid == 0) & (df.npng3d > 0)KL = ['run', 'subRun', 'event', 'slice']return df.groupby(KL)['good'].agg(np.any)

Listing1.Selectiononmultiplecolumnsofatable

Listing2.Groupby andAggregationexampleDistributingandreadinginformation• Eachrankreadsits“fairshare”ofindexinfofromeachtable.–identifieswhichrankshouldhandlewhichevent,formostevenbalance–identifiesrangeofrowsintablethatcorrespondtoeachevent(allslices)

• Event“ownership”informationdistributedtoallranks.–thisassuresnofurthercommunicationbetweenranksisneededwhileevaluatingtheselection.criteriaonaslice-by-slicebasis.–perfectdataparallelisminrunningallselectioncode.

• Eachrankreadsonly relevantrowsofrelevantcolumnsfromrelevanttables.–allrelevantdatareadbysomerank.–norankreadsthesamedataasanother.

sel_nuecosrej vtx_elastic

rank 0

rank 1

Index info read by rank 1Table row read by rank 0

run subRun event slice distallpngtop ...35 more…

16433 61 356124 35 nan16433 61 356124 36 -0.7401374616433 61 356124 37 nan16433 61 356125 1 nan16433 61 356125 2 423.633716433 61 356125 3 -2.849864

run subRun event slice vtxid npng3d …6 more…16433 61 356124 35 0 016433 61 356124 36 0 116433 61 356124 36 1 116433 61 356124 36 2 516433 61 356125 1 0 116433 61 356125 3 0 0

Table2.Thistablehasoneentrypervertex.Somesliceshavenone(anddonotappearinthistable).Somesliceshavemanyvertices.

Table1.Thistablehasoneentryperslice.Wearedoingslice-by-slice selection.

hepnos::DataStore datastore(configfile);for (auto it = datastore.begin(); it != datastore.end(); ++it) for (auto const& ds : *it) for (auto const& r : ds.runs()) for (auto const& sr : r) for (auto const& e : sr) {

std::vector<recob::Hit> hits;art::InputTag hit_tag("gaushit", "", "PrimaryReco");e.load(hit_tag, hits);// .. do something with the vector of Hits ..

}

SummaryandFuturework• WeareexploringhowtoputsupportforMPIparallelismdirectlyintotheinfrastructure.–Willenableourreconstruction/analysiscodestotakeadvantageofdataparallelismanddistributedprogrammingwithoutlookingmuchdifferentthanitdoestoday.

• WewilldeploytoNERSC(throughShifter)andALCF(throughSingularity)andperformscalabilitystudy.• Wewillbecomparingperformanceagainstcurrentmethodsofreadingandwritingdata.• IntegrationofobjectstorewithlargerworkflowutilizingDecafandenableitsuseatdifferentlevelsofprocessing.

Listing3.ExampleofHEPnOSuse:readingfromtheobjectstore

Theuserinterfacelooksalotlikeour

currentuserinterface

FIG.2:Workflowfortranslatingold-styledatatonew-styledata

FIG.3:Anexampleofdistributingandreading

Exampleofnew-styledataorganizationinHDF5

FIG.4:HPCsolution

FIG.6:CMSdetector

FIG.5:DUNE

FIG.8:HEPnOS

FIG.9:classrecob::Track

UsercodeExamples

HEPdataorganization● Acommondesignistohavelevelsofdataaggregation:dataset/run/subrun /event/dataproducts

● Ateachlevelofaggregation,thenameditemsareindependent.

● Datasetscancontainotherdatasets.● Distributionofdataautomatedbythedatamodelimplementation,notuser-visible.

art modules/usercodeAlgo 3Algo 1 Algo 2

Event“Proxy”

get<product>(key)

Source

Prepare“correct”proxy

HEPnOS C++APIload

Interactionwithproxy

InteractionwithHEPnOSUserpath

FIG.7:UserinteractionwithHEPnOS

BAKE SDS-KeyVal

HEPCode

RPC RDMA

PMEM LevelDB

C++API

FERMILAB-POSTER-18-097-CD

Acknowledgement:ThismaterialisbaseduponworksupportedbytheU.S.DepartmentofEnergy,OfficeofScience,OfficeofAdvancedScientificComputingResearch,ScientificDiscoverythroughAdvancedComputing(SciDAC)program,grant1013935.

FIG.10:classrecob::Hit

Goals● Managephysicseventdatafromsimulationandexperimentthroughmultiplephasesofanalysis

● Accelerateaccessbyretainingdatainthesystemthroughoutanalysisprocess

● ReusescomponentsfromMochi ASCRR&Dproject

Properties● Write-once,read-many● Hierarchicalnamespace(datasets,runs,subruns)● C++API(serializationofC++objects)

Components● Mercury,Argobots,Margo,SDSKV,BAKE,SSG● Newcode:C++eventinterface

MapdatamodelintostoresDemo• Usingexistingart framework-relatedsoftware:–gallery providesaccesstoeventdatainart/ROOTfilesfromoutsidetheart framework.–pre-existingbuildsoftheLArSoft dataproductlibraries.• UsingsomeofthedataproductclassesfromLArSoft(usedbyDUNE,etc.)– std::vector<recob::Hit>– std::vector<recob::Track>– art::Assns<recob::Hit, recob::Track>,whicharebidirectionalassociationsbetweenhitsandtracks.

• UsingDockercontainersforeasyportabilityofdevelopmentenvironment• Prototypetestprogramsarealreadyrunningto:– readfromexistingart/ROOTdatafiles,andtowritetothenewdatastore– readfromthenewdatastore,andverifytheintegrityofthedata