1 parallel computing on the grid: experiences from computational finance ian stokes-rees inria...

1

Parallel computing on the grid: Experiences from computational finance

Ian STOKES-REES

INRIA Sophia-AntipolisFrance

2

Outline

Reminder: Grid Vision

Grid Computing Strategies

Parallel Application Development on the Grid

ProActive

PicsouGrid Project

3

Reminder: Grid Vision

Federated Large Scale Heterogeneous Collaborative Dynamic Globally distributed

4

Strategy 1: Infrastructure-level Grid

Fire-and-forget non-interactive “tasks” Queued individually Run individually Results collected and collated at a later date Example Problems: particle physics computing:

reconstruction, Monte Carlo simulation Example Systems: EGEE/WLCG, NGS, TerraGrid,

OSG, Grid5000 Users: have traditional computing tasks, with no

“grid” in them, and just need CPUs to run them on

5

Strategy 2: Application-level Grid

Builds on infrastructure grid resources Provides complete application Grid interface built into application

Or application is only way to access underlying Grid Example Applications: eDiaMoND, MyGrid,

UNICORE Users: Specific to the application, but are “end

users”. Typically don’t expect to “download and install” software. Rather, use “grid application” designed for their specific needs.

6

Strategy 3: Library-level grid

Single-system application linked-in with grid library (semi-) transparently handles application deployment

and execution across grid resources Developers look after “grid” issues either directly or

via Library APIs/functionality. Varying levels of transparency in current offerings Example Libraries: Globus, OMII services, gLite

(perhaps not yet), ProActive, MPICH-G2 (Globus MPI), GridMPI

Users: Software developers who want to leverage grid computing in their applications.

7

Parallel Application Development on the Grid

Big grid resources are out there (WLCG, NGS, OSG, Grid5000)

Managed by other people (great!) Not always possible to install individually on each

system and monitor/tweak operation (not so great!) Remember “Grid Vision”:

Heterogeous Dynamic Federated

How to develop parallel algorithms/applications for distributed, heterogenous systems?

8

Parallel Application Development on the Grid (II)

Synchronisation is difficult (obviously) Distributed logging is difficult Distributed debugging is really difficult Requires a slightly different development paradigm:

Granularity of computation needs to be more coarse Asynchrony is important to avoid blocking Simplicity is important to aid debugging and reduce

sources of error

9

ProActive: Value Proposition

Java VM to reduce/eliminate hardware and software heterogeneity Forces use of Java everywhere Doesn’t hide performance differences!

Benefit from reflection and dynamic class loading Wrap objects in “gridified” sub-class

Provide asynchrony and multi-threading through “Active Objects” Futures Wait-by-necessity

And other features auto-magically added either by developers or at run-time via Active Object factory onto wrapped classes.

10

Active Objects

Deterministic, multi-threaded, distributed inter-object communication, without a priori knowledge of object deployment.

Sequential Multithreaded Distributed

11

Futures and Wait-by-necessity

Method calls on Active Objects: Asynchronous Implicit Futures as RMI result

Wait-By-Necessity: Automatic wait upon the use of an implicit future

First-Class Futures: Futures passed to other activities Sending a future is not blocking

12

Creating Active Objects

MyClass obj_norm = new MyClass(<params>);

MyClass obj_act = newActive(“MyClass”, <params>);

Result r1 = obj_norm.foo(param);

Result r2 = obj_act.foo(param);

Result r3 = obj_act.bar(param);

//...

r2.bar(); //Wait-By-Necessity

In other words, very little effort by developer to introduce distributed multi-threading into application.

13

Active Object Internals

An active object is composed of several objects :1. The object being activated:

Active Object

2. A single thread

3. The queue of pending requests

4. A set of standard Java objects 3

Proxy

Body

Object

Active object

Objet

Standard object

2

1

1

14

And lots of other nice features…

P2P interface File sharing/distribution Security Typed group communication (OOSPMD) Graphical Distributed Monitoring/Debugging

IC2D application Timing and performance API (TimIT) Object migration Load balancing Fault Tolerance/Check-pointing Run time deployment configuration (clusters/nodes) Component model Plus lots of docs, APIs, tutorials, examples, etc.

15

Check it out

Web:http://proactive.objectweb.org

Email:[email protected]

Or, ask me more about it at the pub BTW, group has spin-off company coming this

summer…

16

PicsouGrid: Computational Finance on the Grid

What? Option pricing

Why? Surprisingly, not done much in a grid domain Not many openly available implementations (parallel

or not) How?

ProActive (i.e. Java) on Grid5000 and other grids (WLCG, NGS, DAS-3, …)

17

High Level Project Objectives

Framework for distributed computational finance algorithms

Investigate grid component model http://gridcomp.ercim.org/

Implement open source versions of parallel algorithms for computational finance

Utilise ProActive grid middleware Deploy and evaluate on various grid platforms

Grid5000 (France) DAS3 (Netherlands) EGEE (Europe)

18

Grid Emphasis

This presentation and subsequent paper focuses on developing an architecture for parallel grid computing with: Multi site (5+) Large scale (500-2000 cores) Long term (days to weeks) Multi-grid (2+)

Consequently, de-emphasizes computational finance-specific aspects (i.e. algorithms and application domain) However other team members are working hard on this!

19

ProActive

http://www.objectweb.org/proactive Java Library for Distributed Computing

Developed by INRIA Sophia Antipolis, France (Project OASIS)

50-100 person-years R&D work invested Provides transparent asynchronous distributed

method calls Implemented on top of Java RMI Fully documented (600 page manual) Available under LGPL Used in commercial applications Graphical debugger

20

ProActive (II)

OO SPMD with “Active Objects” Any Java Object can automatically be turned into an “Active

Object” Utilises Java Reflection “Wait by necessity” and “futures” allow method calls to

return immediately and then subsequent object access blocks until result is ready

Objects appear local but may be deployed on any system within ProActive environment (local system/cluster, or remote system, cluster, or grid)

Easy Integration with Existing Systems Extensions seamlessly support various cluster, network, and

grid environments: Globus, ssh, http(s), LSF, PBS, SGE, EGEE, Grid5000

21

Background – Options

Option trading: financial instruments which allow buyers to bet on future asset prices and sellers to reduce risk of owning asset

Call option: allows holder to purchase an asset at a fixed price in the future

Put option: allows holder to sell an asset at a fixed price in the future

Option Pricing: European: fixed future exercise date American: can be exercised any time up to expiry date Basket: prices a set of options together Barrier: exercise depends on a certain barrier price being

reached Uses Monte Carlo simulations Possibility to aggregate statistical results

22

Background – PicsouGrid v1,2,3

Original versions of PicsouGrid utilised: Grid5000 ProActive JavaSpaces

Implemented European Simple, Basket, and Barrier Pricing Medium-size distributed system: 4 sites, 180 nodes Short operational runs (5-10 minutes) Fault Tolerance mechanisms

Achieved 90x speed-up with 140 systems 65% efficiency

Reported in e-Science 2006 (Amsterdam, Nov 2006) A Fault Tolerant and Multi-Paradigm Grid Architecture for Time Constrained

Problems. Application to Option Pricing in Finance.

23

PicsouGrid v3 Performance

Multi-site

Peak speed-up

Performancedegradation

24

PicsouGrid Architecture

Server/Control Node Provides User Interface Instantiates network of Sub-Servers Allows configuration of Simulator network Creates “Request for Option Price” (with algorithm parameters) Controls Sub-Servers and aggregates/reports results Monitors Sub-Servers for failures and spawns new Sub-Servers if

necessary

Sub-Server Acts as local site/cluster/system controller Instantiates local Simulators Delegates simulations in packets to Simulators Collects results, aggregates, and returns to Server Monitors Simulators for failures and spawns new Simulators if necessary

Simulator Computes Monte Carlo simulations for option pricing using packets

25

PicsouGrid Deployment and Operation

reserveworkers

Client Server

Sub-Server

Sub-Server

Worker

ProActive Worker

DB

ProActive

ProActive

ProActive

JavaSpacevirtual sharedmemory (to

v3)

option pricing requestMC simulation packet

heartbeat monitorMC result

26

PicsouGrid v5 Design Objectives

Multi-Grid Grid5000 gLite/EGEE INRIA Sophia desktop cluster

Decoupled Workers Autonomous Independent deployment and operation P2P discover and acquire

Long Running, Multi-Algorithm Create “standing” application Augment (or reduce) P2P worker network based on demand Computational tasks specify algorithm and parameters

27

Grid Performance Monitoring and State Machines

Grid-ified distributed applications add at least three new layers of complexity compared to serial counterpart: Grid interaction and management Local cluster interaction and management Distributed application code

Notoriously difficult to figure out what is going on where and when it is happening: Bottlenecks Hot spots Idle time Limiting factor: CPU, storage, network? What state is an application/task/process/system currently in?

Solution: Utilise a common state machine model for grid applications/processes

28

Layered System Grid

Site

Cluster

Host

Core

VM

Process

29

“Proof” of layering

What I execute on a Grid5000 Submit (UI) Node: mysub -l nodes=30 es-bench1e6

What eventually runs on Worker Node:/bin/sh -c /usr/lib/oar/oarexecuser.sh /tmp/OAR_59658

30 59658 istokes-rees \/bin/bash ~/proc/fgrillon1.nancy.grid5000.fr/submit N script-wrapper \~/bin/script-wrapper fgrillon1.nancy.grid5000.fr \~/es-bench1e6

Granted, this is nothing more than good system design and separation of concerns

We are just looking at the implicit API layers of “the grid” Universal interface: command shell, environment variables

and file system

30

Abstract Recursive Process Model

Question: Is it possible to propose a recursive process model which can be applied at all layers?

Create – process description Bind – process to the physical layer Prepare – prepare for execution (software, stage in, config) Execute – initiate process execution (enter next lower layer) Complete – book keeping, stage out, clean up Clear – wipe system, ready for next invocation

Each stage can be in a particular state: Ready Active Done

31

Grid Process State Machine

Ready

Active

Done

Create

Ready

Active

Done

Bind

Ready

Active

Done

Prepare

Ready

Active

Done

Execute

Ready

Active

Done

Complete

Ready

Active

Done

Clear

Fail CancelSystem UserSuspend Pause

Create process description

Bind to a particular system

Prepare system to execute process

Execute process (recurse to next lower level)

Tidy up system and accounting after completion of process

Clear process from system

32

CREAM Job States

New LCG/EGEE Workload Management System

Can be mapped to Grid Process State Machine

This only shows one level of mapping In practice, would

apply state machine at Grid level, LRMS level, and task level

Timestamps on state entry: Layer.Stage.State

Prepare

Create

Bind

Execute

Done FailedFailed Cancelled

Su

spen

d

33

Grid5000 Stats

9 Sites across France 21 Clusters 17 Batch systems 3138 cores

Xeons Opterons Itaniums G5

Lille

Sophia

Lyon

Nancy

Bordeaux

Paris-Orsay

Toulouse

Grenoble

Rennes

34

Characteristics of Grid5000

Private network Outbound Internet access possibly via

ssh tunnel Access based on ssh keys

(passwordless) Shared NFS file space at each site Very limited data management facilities Myrinet and Infiniband prevalent on

many clusters RENATER French research network,

2.5 to 10 Gb/s inter-site Focus on multi-node (and multi-site)

grid computing Kadeploy provides mechanism for

custom system image to be loaded before job starts

Grid5000 site

35

Deployment and Execution on Grid5000

Limited grid-wide (cross-site) job submission mechanisms In practice, submit individually at each site Coordinate between sites via multiple “reservation” job

submissions with same reservation window Limited data-management/staging/configuration

Kadeploy (often too “heavy weight”) rsync Configuration wrapper scripts

Node count reservations “best effort” Rule of thumb: don’t expect more than 80% of requested nodes

to be available when reservation starts Experience shows reservation start times could be delayed 30

seconds to 10 minutes

36

Experimental Setup

European Simple call/put option price 1e6 Monte Carlo iterations

Single asset pricing reference: treference = 67.3 seconds AMD Opteron 2218 (64 bit) 2.6 GHz 1 MB L1 667

MHz bus (best performing core available) Objective 1: maximize number of options priced in a

fixed time window Objective 2: maximize speed-up efficiency:

(noptions treference)

sites(ncores_i treservation_i)

37

“Run Now” Experiment

Make immediate request for maximum number of nodes on all Grid5000 clusters Price one option per acquired core Not really fair: Grid5000 is not a production grid

Submit to 15 clusters 8 clusters at 6 sites completed tasks within 6 hours Remainder either failed or hadn’t started 24 hours later

1272 cores utilised 85 core-hours occupied

This is the total amount of time the tasks “held” a particular core: idle time + execution time

Objective 1(alt): 1272 options priced in “8 minute window” Objective 2: 1272 options 67.3 s / 85 hr = 28% efficient Discovered various grid issues (e.g. NTP, rsync)

38

Queuing

Queuing

Queuing

Queuing

Execution

Result stage-out

39

When everything is working

40

NTP Problems (Time Sync)

41

Unexplained slow downs(homogeneous cluster)

42

Erratic node/core startup

43

Coordinated Start with Reservation

Reservation made 12+ hours in advance Confirmed no other reservations for time slot Start time at “low utilisation” point of 6:05am

5 minutes provided for system restarts and Kadeploy re-imaging after end of reservations going to 6am

Submitted to 12 clusters, at 8 sites 9 clusters at 7 sites ran successfully

894 cores utilised 31.3 core-hours occupied No cluster reservation started “on time”

Start time delays of 20s to 5.5 minutes Illustrates difficulty of cross-site coordinated parallel processing

Objective 1: 894 options priced in 9.5 minute window Objective 2: 894 options 67.3 s / 31.3 hr = 53.4% efficient Still problems (heterogeneous clusters, NTP, rsync)

45

Intra-node timing variations

46

Core Timeline (detail)

May seem like splitting hairs, but this is important for parallel algorithms with regular communication and synchronisation points

Also, to know where latencies/inefficiencies are introduced

47

Heterogeneous clusters(hyper threading on)

48

Mis-configured timezone

49

Overall cluster benchmarks

50

Parallelism

American option pricing with “floating” exercise date is much more difficult to calculate

Two algorithms with good opportunities for parallelism are available: Longstaff-Schwartz (2001) Ibanez-Zapetero (2002)

Interesting to see what speed up can be achieved by parallel implementation

Interested in possibility of cross-site parallel computation utilising ProActive

51

Longstaff Schwartz

52

Ibanez-Zapetero

53

Multi-Grids

Very interested in experimenting with Multi-Grid environment: Grid5000 gLite/EGEE DAS3 Local cluster/desktop-grid/p2p network

ProActive deploys on LCG (gLite/EGEE) Other ProActive applications deployed and run successfully VO problems in Feb/March meant PicsouGrid could not be run

on LCG – so no results for ISGC! Investigate use of HTTP-based task pools to bridge “grids”

54

Future for PicsouGrid

Many more computational finance algorithms have already been developed and need to be similarly benchmarked: Barrier, Basket American (Longstaff-Schwartz and Ibanez-Zapatero)

“Continuous” operation of option pricing, rather than “one-shot”

Incorporate dynamic node availability Improve modularization/componentization of finance

algorithms

55

Summary of Observations

Deploying parallel applications in a grid environment continues to be a challenging problem

Heterogeneity in a grid is pervasive and still hard to deal with

Understanding performance issues, hot spots, bottlenecks, wasted idle time, and synchronisation points can be aided by a grid process model

Middleware really is critical: gLite, LRMS, OAR, ProActive, etc. need to provide end users and application developers with reliable, consistent, and easy to use interface to “the grid”

56

Thank you

Questions?

https://gforge.inria.fr/projects/picsougrid/

[email protected]

1 parallel computing on the grid: experiences from computational finance ian stokes-rees inria...

Documents