systems seminar schedule monday, 18 februrary, 4pm: – “new wine in old bottles” - douglas...

Post on 16-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Systems Seminar ScheduleSystems Seminar Schedule Monday, 18 Februrary, 4pm:

– “New Wine in Old Bottles” - Douglas Thain 4 March:

– No seminar: Paradyn/Condor Week Tuesday, 19 March, 3pm:

– “The Microsoft .NET System” - Mike Litzkow Tuesday, 2 April, 3pm:

– “Condor and the Grid” - Miron Livny Monday, 15 April, 4pm:

– “Exploiting Gray-Box Knowledge of Buffer-Cache Management” - Nathan Burnett

Monday, 29 April, 4pm:– “Bridging the Information gap in Storage Protocol Stacks” - Tim Denehy

New WineNew Winein Old Bottles:in Old Bottles:

Java on CondorJava on Condor

Douglas Thain

University of Wisconsin

18 February 2002

AbstractAbstract

We have added Java support to Condor. I’ll tell you how it works and how to use it. There are some nifty features for end users.

Adding this code forced us to think about the fundamental problem of coupling systems and representing errors.

A lesson: One must consider the scope of an error as well as its detail.

Disclaimer:Disclaimer:

This is still rough around the edges.

(Someone had to go first!)

OutlineOutline

Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions

Java for Scientific ComputingJava for Scientific ComputingJava is emerging as a tool for large scale

(Grande) scientific computing.– More accessible to domain scientists.– Simplified porting.– Faster development, debugging.

User communities are forming:– ACM Java Grande Conference– The Java Grande Forum

A. Globus, E. Langhirt, M. Livny, R. Ramamurthy, M. Solomon, and S. Traugott. JavaGenes and Condor: Cycle-Scavenging genetic algorithms. ACM Conf on Java Grande, 2000.

LimitationsLimitations

Java floating point and complex arithmetic do not yet satisfy all of the scientific community.– Arguments continue between industry and academia.

Java is yet slower than comparable programs in C/C++/Fortran.– WAT compilers and JIT compilers are catching up.– You choose: 2x slowdown vs 5x machines.

Can we really harness 5x machines while still maintaining platform independence?

Condor for Scientific Condor for Scientific ComputingComputing

Condor creates a high-throughput computing system on a community of computers.

A high-throughput computing system seeks to maximize the amount of work done over a long period of time.

A community of computers may be any collection of machines that agree to work together.

Condor Enables Ordinary UsersCondor Enables Ordinary Users

INFN CentralManager

condorschedd

Job

Job

Job

Job

Job

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

Job Job

Job

JobJob

Job

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

Job Job

Job

JobJob

Job

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

condorstartd

RAMcpu

Job Job

Job

JobJob

Job

UWCS CentralManager

0

100

200

300

400

500

600

700

800

226 Condor Pools

5576 Condor Hosts

Top 10 Condor Pools:

The Hype:The Hype:Java:

– “Write once, run anywhere!”Condor:

– “Submit once, run everywhere!”The Grid:

– Uniform, dependable, consistent, pervasive, and inexpensive computing.

The RealityThe Reality

Coupling systems is not trivial!The easy part:

– Putting java in front of the program name.The tricky parts:

– Java installation messes.– Unavailable file systems.– Distinguishing program errors from

environmental errors.

OutlineOutline

Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions

schedd startd

MatchMaker

MachinePolicies

JobPolicies

HomeFile

System

Claiming Protocol

Activation Protocol

Matchmaking Protocol

Execution Protocolshadow starter

ForkFork

The Job

Fork

Creates the execution environment.

Exports the details, policy, and I/O services.

JVM

Fork

startershadow

ForkFork

HomeFile

System

Wrapper

I/O Library

The Job

I/O Server I/O Proxy

Secure Remote I/O

Local System Calls

Local I/O(Chirp)

User InterfaceUser Interface

condor_status -java

Name JavaVendor Ver State Activity LoadAv Mem

aish.cs.wisc. Sun Microsy 1.2.2 Owner Idle 0.000 249

anfrom.cs.wis Sun Microsy 1.2.2 Owner Idle 0.030 249

babe.cs.wisc. Sun Microsy 1.2.2 Claimed Busy 1.120 123

...

Machines Owner Claimed Unclaimed Matched Preempting

INTEL/LINUX 514 101 408 5 0 0

Total 514 101 408 5 0 0

User InterfaceUser Interface

universe = javaexecutable = Main.classjar_files = MyLibrary.jarinput = infileoutput = outfilearguments = Main 1 2 3queue

condor_submit

I/O InterfaceI/O Interface

Input, output, and error files are automatically transferred to/from the execution site.

Any other named files may be transferred as well. To do online I/O without transferring whole files,

you must make small changes to the code:– FileInputStream -> ChirpInputStream– FileOutputStream -> ChirpOutputStream

Application

Java Standard Libraries

Java Virtual Machine

Operating System

C Standard Library

Chirp I/O Library

Added a new library on existing interfaces. User must call new constructors.

JNI

Java symbols are fully qualified, so transparent replacedment of classes is not possible.

Could replace native methods in the JVM, but this ties us to open-source JVMs.

Could trap real system calls, but these are complex (asynchronous, nonblocking, threaded) and may be difficult to distringuish from the JVM’s own operations.

OutlineOutline

Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions

Initial ExperienceInitial Experience

Bad news: Nearly any unexpected failure would cause the job to be returned to the user:– Out of memory at execution site.– Java misconfigured at execution site.– I/O proxy can’t initialize.– Home file system offline.

Initial ExperienceInitial Experience

Although this was correct in some sense -- the information was true -- it was very frustrating.

Users want to know when their program fails by design (NullPointerException,) but not if it fails due to the environment.

What did we do wrong?

OutlineOutline

Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions

A Little Error TheoryA Little Error Theory

Build on standard definitions from fault-tolerance and programming languages.

Some brief examples to get the idea.Return to Condor and use the theory to

understand our design mistakes.

Fault Tolerance TerminologyFault Tolerance Terminology

Failure– An externally-visible deviation from

specifications.

Error– An internal data state that leads to a failure.

Fault– An external event that creates an error.

A. Avizienis and J.C. Laprie. Dependable computing: From concepts to design diversity. IEEE 74(5) May 1986.

ExampleExample

Client Server

What is sqrt(4)?Hmm, sqrt(4) is...

Hmm, sqrt(9) is...Answer: 3

ERRORFAILURE

FAULT

Implicit errors– The system claims to have reached a valid result, but

an auditor claims it is invalid. Example: sqrt(3)==2 Explicit errors

– The system tells us it cannot complete the desired action. Example: file not found.

Escaping errors– The system detects an error, but has no method of

reporting it, so it escapes by an alternate route. Example: core dump, kernel panic.

John B. Goodenough, Exception Handling: issues and a proposed notation. CACM 18(120, December 1975.K. Ekandham and A. Bernstein. Some new Transitions in hierarchical level structures. Operating Systems Review 12(4), 1978.

Program

Virtual Memory System

PhysicalMemory

BackingStore

load data

Could return a default value, but that creates an implicit error.

Would like to return an explicit error, but a load insn has no exit code.

ParentProcess

Escaping error: Tell the parent that the program could not complete.

NormalExit

AbnormalExit

Interface ContractsInterface Contracts

int load( int address );

The implementor must either compute a result that conforms to the contract, or is obliged to cause an escaping error.

C. Hoare. An axiomatic basis for computer programming. CACM 12(10:576-580, October 1969.B. Meyer. Object-Oriented Software Construction. Prentice Hall, 1997.

ExceptionsExceptions

int open( String filename )

throws FileNotFound, AccessDenied;

A language with exceptions provides more structure to the contract. A declared exception is an explicit error. Yet, escaping errors are still possible.

Program

Virtual File System

MemoryDisk

open

Success,FileNotFound,AccessDenied

ParentProcess

NormalExit

AbnormalExit

MemoryCorrupt,DiskOffline,PigeonLost

INTERFACE

IMPLEMENTATION

Error ScopeError Scope

In order to be accepted by end users, a distributed system must be able to distinguish between errors computed by the program and errors forced upon it by the environment.

We use the term scope to draw the distinction.

Error ScopeError ScopeThe scope of an error is the portion of the

system that it invalidates.An error must be delivered to the process

responsible for managing that scope.

Error Scope Handler

FileNotFound File Calling Function

RPC Disconnect Process Parent Process

Cache Coherency Problem

Machine Hypervisor or Operator

PVM Node Crash PVM Cluster Parent Process

Error DetailError Detail

The detail of an error describes in phenomenological terms the cause of the error.

In the right hands, the detail is useful. In the wrong hands, the detail can be misleading.

Suppose open returns AccessDenied...– File is not accessible - Ok.– Library containing ‘open’ is not accessible - Problem!

LessonsLessons

Principle 1:– A routine must not generate an implicit error as a result

of receiving an explicit error.

Principle 2:– An escaping error converts a potential implicit error

into an explicit error at a higher level.

Principle 3:– An escaping error must be propagated to the program

that manages the error’s scope.

OutlineOutline

Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions

Java and Condor RevisitedJava and Condor Revisited

What did we do wrong?

We focussed on error detail without considering error scope.

Java and Condor RevisitedJava and Condor Revisited

To fix the system, we revisited the notion of error scope throughout.

Two examples:– JVM exit code– I/O errors

JVM Exit CodeJVM Exit CodeDetail Scope Exit Code

Program exited by completing main Program 0

Program exited through System.exit(x) Program x

Exception: Null pointer. Program 1

Exception: Out of memory. Virtual Machine

1

Exception: Java Misconfigured. Remote Resource

1

Exception: Home file system offline. Local Resource

1

Exception: Program image corrupt. Job 1

JVM

startershadow

ForkFork

HomeFile

System

Wrapper

I/O Library

The Job

ResultFile

JVM Result

Result of Execution Attempt + Result of Program, If any.

Starter Result +Program Result

I/O Error ScopeI/O Error Scope

All Java I/O operations throw a single exception type -- IOException.

Our mistake: convert all detected errors into IOExceptions and pass them to the program.

Makes sense for FileNotFound, but not for ProxyUnavailable or CredentialsExpired.

JVM

starter

Wrapper

I/O Library

The Job

ResultFile

JVM Result

Result of Execution Attempt + Result of Program, If any.

To I/O Proxy

Error OutsideProgram Scope

Error InsideProgram Scope

OutlineOutline

Why Java and Condor?ArchitectureInitial ExperienceA Little Error TheoryChanges for the BetterConclusions

ConclusionConclusion

We started building the Java Universe with some naive assumptions about errors.

On encountering practical difficulties, we thought more abstractly about errors and developed the notion of scope and detail.

By routing errors according to their scope, we made the system more robust and usable.

Food for ThoughtFood for Thought

There isn’t always an easy way to propagate an error to the scope handler.– Escaping error to parent process:

Raise a POSIX signal.

– Escaping error to the starter: Throw a Java Error, trapped by the Wrapper, placed

in file, read after process exits.

Food for ThoughtFood for Thought

The mere use of exceptions in a program does not imply a disciplined error management.

For example, throws IOException is a very vague statement about an interface.

What is an implementor allowed to throw?– Can open() return FileNotFound?

(Probably.)

– Can read() throws FileNotFound? (Asking for trouble.)

– What about ConnectionRefused?

Food for ThoughtFood for Thought An contract can govern more than simply the

interface specification. Consider this self-cleaning program:

fd = open(“file”);unlink(“file”);close(fd);

Works on UNIX, fails on WinNT. Can an interface (code+docs) really state all the

necessary semantic information? Should it?

DeploymentDeployment

As of February 14th, the Java Universe is running on 515 RedHat 7.2 machines.

Will be rolled out as part of Condor 6.3.2 on all platforms in the regular release schedule.

Sun JDK 1.2.2 on UNIX machines. Sun JDK 1.3.2 on WinNT machines. “Is the Java Universe available on my machine?”

– condor_status -java

c2 cluster

tux lab

istat

skywalker.cs.wisc.edu

AcknowledgementsAcknowledgements

Although we me take credit (or blame) for the most recent changes, the Condor architecture has dealt with errors for many years. Much credit goes to the core designers, esp. Mike Litzkow, Todd Tannenbaum, and Derek Wright.

More Info:More Info:

The Condor Project:– http://www.cs.wisc.edu/condor

These slides:– http://www.cs.wisc.edu/~thain

Douglas Thain– thain@cs.wisc.edu

Questions now?

top related