massive simulations in spark: distributed monte carlo for global health forecasts

38
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts Kyle Foreman, PhD 07 June 2016

Upload: jen-aman

Post on 16-Apr-2017

1.049 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Massive Simulations In Spark: Distributed Monte Carlo For Global Health ForecastsKyle Foreman, PhD07 June 2016

Page 2: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Simulations in Spark

• Context• Motivation• SimBuilder• Backends• Benchmarks• Discussion

2

Page 3: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

What is IHME?

3

• Independent research center at the University of Washington

• Core funding by Bill & Melinda Gates Foundation and State of Washington

• ~300 faculty, researchers, students and staff

• Providing rigorous, scientific measuremento What are the world’s major health problems? o How well is society addressing these problems? o How should we best dedicate resources to

improving health?

Goal: improve health

by providing the best information

on population health

Page 4: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Global Burden of Disease• A systematic scientific effort to

quantify the comparative magnitude

of health loss due to diseases,

injuries and risk factors

o 188 countries

o 300+ diseases

o 1990 – 2015+

o 50+ risk factors

4

Page 5: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

GBD Collaborative Model

1,100 experts in 106 countries

Page 6: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Death Rate in 2013 (age-standardized)

6

Page 7: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Childhood Death Rate in 2013

7

Page 8: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

DALYs in High Income Countries (2010)

8

Page 9: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

DALYs in Low & Middle Income Countries (2010)

9

Page 10: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Simulations in Spark

• Context• Motivation• SimBuilder• Backends• Benchmarks• Discussion

10

Page 11: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Forecasting the Burden of Disease

11

1. Generate a baseline scenario projecting the GBD 25 years into the futureo Mortality, morbidity, population, and risk factorso Every country, cause, sex, age

2. Create a comprehensive simulation framework to assess alternative scenarioso Capture the complex interrelationships between risk factors, interventions, and

diseases to explore the effects of changes to the system

3. Build a modular and flexible platform that can incorporate new models to answer detailed “what if?” questionso E.g. introduction of a new vaccine, effects of global warming, scale up of ART

coverage, risk of global pandemics, etc

Page 12: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Simulations

• Takes into account interdependencies between risk factors, diseases, etc.o Use draws of model parameters to advance from t to t+1, allowing for

forecasts of other quantities (e.g. risk factors and mortality) to interact• Modular structure allows for detailed “what ifs”

o E.g. scale up of coverage of a new intervention• Allows us to incorporate alternative models to test sensitivity to model

choice and specification

12

Page 13: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Basic Simulation Process

1. Fit statistical models (PyMB) capturing most important relationships as statistical parameters

2. Generate many draws from those parameters, reflecting their uncertainty and correlation structure

3. Run Monte Carlo simulations to update each quantity over time, taking into account its dependencies

13

Page 14: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Simulations in Spark

• Context• Motivation• SimBuilder• Backends• Benchmarks• Discussion

14

Page 15: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

SimBuilder

• Directed Acyclic Graph (DAG) construction and validationo yaml and sympy specificationo curses interfaceo graphviz visualization

• Flexible execution backendso pandas

o pyspark

15

Page 16: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Example Global Health DAG

16

GDP 2016

Deaths 2016

Births 2016

Population 2016

Workforce 2016

Deaths 2017

Births 2017

Population 2017

Workforce 2017

GDP 2017

Page 17: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Creating a DAG with SimBuilder

1. Create a YAML for the overall simulationo Specify child models and the dimensions of the simulation

2. Build separate YAML files for each child modelo Dimensions along which the model varieso Location of historical datao Expression for how to generate the quantity in t+1

3. Run SimBuilder to validate and explore DAG

17

Page 18: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Creating a DAG with SimBuilder

18

name: gdp_pop models: - name: gdp_per_capita - name: workforce - name: mortality - name: gdp - name: pop global_parameters: - name: loc set: List values: ["USA", "MEX", "CAN", ..., ] - name: t set: Range start: 2014 end: 2040 - name: draw set: Range start: 0 end: 999

sim.yaml

Page 19: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

19

Creating a DAG with SimBuildername: pop version: 0.1 variables: [loc,t] history: - type: csv path: pop.csv

pop.yaml

Page 20: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Creating a DAG with SimBuildername: gdp_per_capitaversion: 0.1expr: gdp(draw, t, loc)/pop(draw, t-1, loc)variables: [draw, t, loc]history: - type: csv path: gdp_pc_w_draw.csv

gdp_per_capita.yaml

20

Page 21: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

21

Creating a DAG with SimBuildername: gdp version: 0.1 expr: gdp(d,loc,t-1) * exp( alpha_0(loc, t, draw) + beta_gdp(draw) * log(gdp_per_capita(draw, loc, t-1)) + beta_pop(draw, t) * workforce(loc, t) ) variables: [draw ,loc,t] history: - type: csv path: gdp_w_draw.csv model_parameters: - name: alpha_0 variables: [draw, loc, t] history: - type: csv path: gdp/alpha_0.csv - name: beta_gdp variables: [d] history: - type: csv path: gdp/beta_gdp.csv - name: beta_pop variables: [d] history: - type: csv path: gdp/beta_pop.csv

gdp.yaml

Page 22: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

DAG

22

gdp{0}

gdp_per_capita{0}

workforce{0}

pop{0}

mortality{0}

gdp_beta_gdp gdp_alpha_0{0} gdp_beta_pop{0}

mortality_beta_pop{0} mortality_alpha_0{0} mortality_beta_gdp

Page 23: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

23

gdp{0}

gdp{1}

gdp_per_capita{0}

gdp_per_capita{1}

workforce{0}

mortality{1}

workforce{1}

pop{0}

pop{1}

mortality{0}

gdp_beta_gdpgdp_alpha_0{0}gdp_beta_pop{0}

gdp_alpha_0{1} gdp_beta_pop{1}

mortality_beta_pop{0} mortality_alpha_0{0} mortality_beta_gdp

mortality_beta_pop{1} mortality_alpha_0{1}

Page 24: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

24

gdp{0}

gdp{1}

gdp_per_capita{0}

gdp{2}

gdp_per_capita{1}

gdp{3}

gdp_per_capita{2}

gdp{4}

gdp_per_capita{3}

gdp_per_capita{4}

workforce{0}

mortality{1}

workforce{1}

mortality{2}

workforce{2}

mortality{3}

workforce{3}

mortality{4}

workforce{4}

pop{0}

pop{1}

pop{2}

pop{3}

pop{4}

mortality{0}

gdp_beta_gdpgdp_alpha_0{0} gdp_beta_pop{0}

gdp_alpha_0{1} gdp_beta_pop{1}

gdp_alpha_0{2}gdp_beta_pop{2}

gdp_alpha_0{3} gdp_beta_pop{3}

gdp_alpha_0{4}gdp_beta_pop{4}

mortality_beta_pop{0}mortality_alpha_0{0}mortality_beta_gdp

mortality_beta_pop{1}mortality_alpha_0{1}

mortality_beta_pop{2} mortality_alpha_0{2}

mortality_beta_pop{3} mortality_alpha_0{3}

mortality_beta_pop{4} mortality_alpha_0{4}

Page 25: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Simulations in Spark

• Context• Motivation• SimBuilder• Backends• Benchmarks• Discussion

25

Page 26: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

SimBuilder Backends

• SimBuilder traverses the DAGoBackends plugin via API

• Backends provideoData loading strategyoMethods to combine data from different nodes to

calculate new nodesoMethod to save computed data

26

Page 27: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Local Pandas Backend

• Useful for debugging and small simulations• Master process maintains list of Model objects:

o pandas DataFrames of all model data─ Indexed by global variables

o sympy expression for how to calculate t+1 ─ Vectorized when possible, fallback to row-wise apply

• Simulating over time traverses DAG to join necessary DataFrames and compute results to fill in new columns

27

Page 28: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Spark DataFrame Backend

• Similar to local backend, swapping in Spark’s DataFramefor Pandas’

• Spark context maintains list of Model objects:o pyspark DataFrames of all model data

─ Columns for each global variable and o sympy expression for how to calculate t+1

─ Calculated using row-wise apply

• Joins DataFrames where necessary

28

Page 29: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Spark + Numpy Backend

• Uses Spark to distribute and schedule the execution of vectorized numpy operations

• One RDD for each node and year:o Containing an ndarray and index vector for each axiso Can optionally partition the data array into key-value pairs

along one or more axes (similar to bolt)• To calculate a new node, all the parent RDD’s are

unioned and then reduced into a new child RDD

29

Page 30: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Simulations in Spark

• Context• Motivation• SimBuilder• Backends• Benchmarks• Discussion

30

Page 31: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Benchmarking

• Single executor (8 cores, 32GB)• Synthetic data

o 65 nodes in DAGo 3 countries x 20 ages x 2 sexes x N drawso Forecasted forward 25 years

31

Page 32: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Benchmarks

32

Page 33: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Benchmarks

33

Page 34: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Simulations in Spark

• Context• Motivation• SimBuilder• Backends• Benchmarks• Discussion

34

Page 35: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Limitations of Spark DataFrame

• Majority of execution time is spent joining the DataFrameand aligning the dimension columns

• These times were achieved after careful tuning of partitionso Perhaps custom partitioners could reduce runtime further

35

Page 36: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Taking Advantage of Numpy

• For multidimensional datasets of consistent shape and size, simply aligning indices can eliminate the overhead of joinso Including generalizations to axes of size 1 to simplify coding

• Numpy’s vectorized operations are highly tuned and scale wello Including multithreading when e.g. compiled against MKL

36

Page 37: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Future Development

• Experiment with better partitioning of Numpy RDDso Perhaps use bolt?

• Improve partial graph executiono Executing the entire DAG at once often crashes the drivero Executing each year’s partial DAG in sequence results in idle CPU

time towards the end of that DAG

• Investigate more efficient Spark DataFrame joining methods for Panel data

37

Page 38: Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

Team

Kyle [email protected]

Chun-Wei [email protected]

Kyle [email protected]

healthdata.org