resilience at exascale

Resilience at Exascale

Marc Snir Director, Mathema0cs and Computer Science Division Argonne Na0onal Laboratory Professor, Dept. of Computer Science, UIUC

The Problem

•  “Resilience is the black swan of the exascale program” •  DOE & DoD commissioned several reports

–  Inter-‐Agency Workshop on HPC Resilience at Extreme Scale hOp://ins0tute.lanl.gov/resilience/docs/Inter-‐AgencyResilienceReport.pdf (Feb 2012) –  U.S. Department of Energy Fault Management Workshop hOp://shadow.dyndns.info/publica0ons/geist12department.pdf (June 2012) –  ICIS workshop on “addressing failures in exascale compu0ng” (report forthcoming)

•  Talk: –  Discuss the liOle we understand –  Discuss how we could learn more –  Seek you feedback on the report’s content

2

WHERE WE ARE TODAY

3

Failures Today

•  Latest large systema0c study (Schroeder & Gibson) published in 2005 – quite obsolete

•  Failures per node per year: 2-‐20 •  Root cause of failures

–  Hardware (30%-‐60%), SW (5%-‐25%), Unknown (20%-‐30%) •  Mean-‐Time to Repair: 3-‐24 hours

•  Current anecdotal numbers: –  Applica0on crash: >1 day –  Global system crash: > 1 week –  Applica0on checkpoint/restart: 15-‐20 minutes (checkpoint oden “free”)

–  System restart: > 1 hour –  Sod HW errors suspected

4

Recommendation

•  Study in systema0c manner current failure types and rates –  Using error logs at major DOE labs –  Pushing for common ontology

•  Spend 0me hun0ng for sod HW errors •  Study causes of errors (?)

– Most HW faults (>99%) have no effect (processors are hugely inefficient?)

–  The common effect of sod HW bit flips is SW failure –  HW faults can (oden?) be due to design bugs –  Vendors are cagey

•  Time for a consor0um effort?

5

Current Error-Handling

•  Applica0on: global checkpoint & global restart •  System:

–  Repair persistent state (file system, databases) –  Clean slate restart for everything else

•  Quick analysis: –  Assume failures have Poisson distribu0on checkpoint 0me C=1; recover+restart 0me = R; MTBF=M

–  Op0mal checkpoint interval τ sa0sfies

–  System u0liza0on U is

6

−τ /Me = (M −τ +1) /M

U = (M − R−τ +1) /M

Utilization, assuming R=1

•  EE.g. –  C=R=15 mins (=1) – MTBF= 25 hours (=100)

–  U ≈ 85%

7

Utilization as Function of MTBF and R

8

•  E.g. –  C = 15 mins (=1) –  R = 1 hour (=4) – MTBF= 25 hours (=100)

•  U ≈ 80%

Projecting Ahead

•  “Comfort zone:” –  Checkpoint 0me <1% MTBF –  Repair 0me <5% MTBF

•  Assume MTBF = 1 hour –  Checkpoint 0me ≈ 0.5 minute –  Repair 0me ≈ 3 minutes

•  Is this doable? –  Yes if done in memory –  E.g., using “RAID” techniques

9

Exascale Design Point

Systems 2012 BG/Q

Computer

2020-‐2024 Difference Today & 2019

System peak 20 Pflop/s 1 Eflop/s O(100)

Power 8.6 MW ~20 MW

System memory 1.6 PB (16*96*1024)

32 -‐ 64 PB O(10)

Node performance 205 GF/s (16*1.6GHz*8)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 42.6 GB/s 2 -‐ 4TB/s O(1000)

Node concurrency 64 Threads O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-‐400GB/s O(10)

System size (nodes) 98,304 (96*1024)

O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 5.97 M O(billion) O(1,000)

MTTI 4 days O(<1 day) -‐ O(10)

Both price and power envelopes may be too aggressive!

Time to checkpoint •  Assume “Raid 5” across memories, checkpoint size ~ 50%

memory size –  Checkpoint 0me ≈ 0me to transfer 50% of memory to another node: Few seconds!

– Memory overhead ~ 50% –  Energy overhead small with nVRAM –  (But need many write cycles) – We are in comfort zone (re checkpoint)

11

•  How about recovery? •  Time to restart applica0on same

as 0me to checkpoint •  Problem: Time to recover if

system failed

Key Assumptions:

1.  Errors are detected (before the erroneous checkpoint is

commiOed)

2.  System failures are rare (<< 1 day) or recovery from them is very fast (minutes)

12

Hardware Will Fail More Frequently

•  More, smaller transistor •  Near-‐threshold logic

☛ More frequent bit upsets due to radia0on ☛ More frequent mul0ple bit upsets due to radia0on ☛ Larger manufacturing varia0on ☛ Faster aging

13

Hardware Error Detection: Assumptions

14

3.4.1 Compute Node Soft Errors and Failures

Soft errors and failures in the compute node (processor and memory only; network, power, and coolingdiscussed later in this section) are a result of events that are entirely external to the system and cannot bereplicated. Furthermore, soft faults are transient in nature and leave no lasting impact on hardware. By farthe most significant source of soft faults are energetic particles that interact with the silicon substrate andeither flip the state of a storage element or disrupt the operation of a combinational logic circuit. The twocommon sources of particle strike faults are alpha particles that originate within the package and high-energyneutrons. When a high-energy neutron interacts with the silicon die, it creates a stream of secondary chargedparticles. These charged particles then further interact with the semi-conductor material, freeing electron-hole pairs. If the charged particle creates the electron-hole pairs within the active region of a transistor, acurrent pulse is formed. This current pulse can directly change the state of a storage device or can manifestas a wrong value at the end of a combinational logic chain. Alpha particles are charged and may directlycreate electron-hole pairs.

To model the impact a particle strike has on a compute node, we model the effect on each node compo-nent separately, namely: SRAM, latches, combinational logic, DRAM, and NV-RAM. We then determine arough estimates for the number of units of each component within the node. We use this estimate to providea very rough order-of-magnitude type fault rates for the compute node. We also briefly mention how suchfaults are handled in processors today and discuss how advances in process technology are expected to affectthese soft faults. We make projections for the impact of particle-strike soft errors on a future 11nm node,as well as present an estimate of the overhead/error-rate tradeoffs at the hardware level. The estimates arebased on the models below and on some assumptions about the components of a node, as shown in Table 2.

A few important caveats about the models and projections:

• The numbers summarized in the table below do not include hard errors, including intermittent harderrors. We expect intermittent hard errors and failures in hardware to be a significant contributor tosoftware-visible errors and failures.

• We do not have access to good models for the susceptibility of near-threshold circuits and do notconsider such designs.

• We give only a rough, order-of-magnitude at best, type estimate; many important factors remainunknown with respect to the 11nm technology node.

Table 2: Summary of assumptions on the components of a 45nm node and estimates of scaling to 11nm.45nm 11nm

Cores 8 128

Scattered latches per core 200, 000 200, 000

Scattered latchs in uncore relative to cores pncores

⇥ 1.25 = 0.44pncores

⇥ 1.25 = 0.11

FIT per latch 10�1 10�1

Arrays per core (MB) 1 1

FIT per SRAM cell 10�4 10�4

Logic FIT / latch FIT 0.1� 0.5 0.1� 0.5

DRAM FIT (per node) 50 50

12

Hardware Error Detection: Analysis

15

Array interleaving and SECDED(Baseline)

DCE [FIT] DUE [FIT] UE [FIT]

45nm 11nm 45nm 11nm 45nm 11nmArrays 5000 100000 50 20000 1 1000

Scattered latches 200 4000 N/A N/A 20 400

Combinational logic 20 400 N/A N/A 0 4

DRAM 50 50 0.5 0.5 0.005 0.005

Total 1000 - 5000 100000 10 - 100 5000 - 20000 10 - 50 500 - 5000

Array interleaving and ¿SECDED(11nm overhead: ⇠ 1% area and < 5% power)



Scattered latches 200 4000 N/A N/A 20 400

Combinational logic 20 400 N/A N/A 0.2 5

DRAM 50 50 0.5 0.5 0.005 0.005

Total 1500 - 6500 100000 10 - 50 500 - 5000 10 - 50 100 - 500

Array interleaving and ¿SECDED + latch parity(45nm overhead ⇠ 10%; 11nm overhead: ⇠ 20% area and ⇠ 25% power)



Scattered latches 200 4000 20 400 0.01 0.5

Combinational logic 20 400 N/A N/A 0.2 5

DRAM 0 0 0.1 0.0 0.100 0.001

Total 1500 - 6500 100000 25 - 100 2000 - 10000 1 5 - 20

Table 3: Summary of per-processor particle-strike soft error characteristics within a compute node (sea level,equator). Note that other sources of transient faults cannot be ignored.

SRAM. Large SRAM arrays dominate the raw particle-strike fault rate of a processor silicon die. When aparticle strike releases charge within an active region of a transistor in an SRAM cell, the charge collectedmay exceed the charge required to change the value stored in the cell, causing a single event upset (SEU). AnSEU may impact a single SRAM cell or may change the values of multiple adjacent cells. Such multi-cellupsets (MCUs) are also called burst errors. A reasonable ball-park number for SRAM particle-strike upsetrate is 1 upset every 107 hours for 1Mb of capacity, which is a rate of 10�4 FIT/bit. Our best estimatesindicate that the SEU rate for SRAM will remain roughly constant as technology scales. While manycomplex phenomena impact susceptibility, the current roadmap of changes to devices, operating voltage,and scale, do not point to extreme changes in susceptibility. What is expected to change is the distributionof MCUs, with a single upset more likely to affect longer burst of cells at smaller scales.

Because the raw FIT/chip from SRAM is high (estimated at roughly 0.5 upsets per year per chip, ormultiple upsets an hour in a large-scale HPC system), large arrays are protected with error detection anderror correction capabilities. An approach in use today is a combination of physical word interleavingcoupled with an error detection code (EDC) or with error checking and correcting (ECC) mechanisms.Given the distribution of MCUs today, 4-way interleaving with SECDED capabilities per array line aresufficient. Stronger capabilities will likely be needed in the future, but their energy and area overhead isexpected to be low (see Table ??). Note that our estimates assume that 4-bit or longer bursts increase from1% of all SEUs to 10% or higher between 45nm and 11nm technology and that the rate of bursts of 8 bits orlonger increases from 0.01% of all SEUs to 1% of all SEUs.

Note that alternative storage technology with much lower particle-strike error rates is possible. Somecurrent processors use embedded DRAM for large arrays and there is a possibility that future processorswill use on-chip arrays of non-volatile storage. Embedded DRAM has a 100 times or more lower error ratethan SRAM. Non-volatile storage cells are immune to particle strikes, but do display some soft error fault

13

Summary of (Rough) Analysis

•  If no new technology is deployed can have up to one undetected error per hour

•  With addi0onal circuitry could get down to one undetected error per 100-‐1,000 hours (week – months) –  The cost could be as high as 20% addi0onal circuits and 25% addi0onal power

– Main problems are in combinatorial logic and latches –  The price could be significantly higher (small market for high-‐availability servers)

•  Need sodware error detec0on as an op0on!

16

Application-Level Data Error Detection •  Check checkpoint for correctness before it is commiOed.

–  Look for outliers (assume smooth fields) – handle high order bit errors

–  Use robust solvers – handle low-‐order bit errors •  May not work for discrete problems, parBcle simulaBons,

disconBnuous phenomena, etc. •  May not work if outlier is rapidly smoothed (error propagaBon)

–  Check for global invariants (e.g., energy preserva0on) •  Ignore error •  Duplicate computa0on of cri0cal variables

•  Can we reduce checkpoint size (or avoid them altogether)? –  Tradeoff: how much is saved, how oden data is checked (big/small transac0ons)

17

How About Control Errors?

•  Code is corrupted, jump to wrong address, etc. –  Rarer (more data state than control state) – More likely to cause a fatal excep0on – More easy to protect against

18

How About System Failures (Due to Hardware)?

•  Kernel data corrupted –  Page tables, rou0ng tables, file system metadata, etc.

•  Hard to understand and diagnose –  Break abstrac0on layers

•  Make sense to avoid such failures –  Duplicate system computa0ons that affect kernel state (price can be tolerated for HPC)

–  Use transac0onal methods –  Harden kernel data

•  Highly reliable DRAM •  Remotely accessible NVRAM

–  Predict and avoid failures

19

Failure Prediction from Event Logs

20

types of signals: periodic, noise and silent. Figure 1 presentsthe three types and the possible cause for each type.

We observed that a fault trigger in the system does not havea consistent representation in the logs. For example, a memoryfailure will cause the faulty module to generate a large numberof messages. Conversely, in case of a node crash the errorwill be characterized by a lack of notifications. Data miningalgorithms in general assume that faults manifest themselfs inthe same way and in consequence fail to handle more thanone type behaviors.

For example, even though silent signals represent the ma-jority of event types, data mining algorithms fail to extractthe correlation between them and other types of signals. Thisaffects fault prediction in both the total number of faults seenby the method and in the time delay offered between theprediction and the actual occurrence of the fault.

Signal analysis methods concepts can handle all three signaltypes, and thus provide a larger set of correlations that can beused for prediction. However, data mining algorithms are moresuited in characterizing correlations between different highdimensionality sets than the cross correlation function offeredby signal analysis. Data mining is a powerful technology thatconverts raw data into an understandable and actionable form,which can then be used to predict future trends or providemeaning to historical events.

Additionally, outlier detection has a rich research historyin incorporating both statistical and data mining methods fordifferent types of datasets. Moreover, they are able to implic-itly adapt to changes in the dataset and to apply thresholdbased distance measures separating outliers from the bulk ofgood observations. In this paper, we combine the advantagesof both methods in order to offer a hybrid approach capable ofcharacterizing different behaviors given by events generated bya HPC system and providing an adaptive forecasting methodby using latest data mining techniques.

In the following sections we present the methodology usedfor preprocessing the log files and extracting the signals andthen we introduce the novel hybrid method that combinessignal analysis concepts with data mining techniques foroutlier detection and correlation extraction. An overview ofthe methodology is presented in figure 2.

A. Preprocessing

Log files generated by large HPC system contain millionof message lines, making their manual analysis impossible.Moreover, the format in which messages are generated is notstructured and differs between different system and sometimeseven between different components of the same machine. Inorder to bring structure to our analysis, we extract the descrip-tion of all generated messages. These descriptions representthe characterization of different events used by the system.Also, as software changes with versions, bug fixes or driverupdates, these descriptions are modified to reflect the system’soutput at all time.

For this, we performed an initial pass over the logs files toidentify frequently occurring messages with similar syntactic

Fig. 2. Methodology overview of the hybrid approach

patterns. Specifically, we use the Hierarchical Event LogOrganizer [15] on the raw logs, resulting in a list of messagetemplates. These templates represent regular expressions thatdescribe a set of syntactically related messages and definedifferent events in a system. In the on-line phase, we useHELO on-line to keep the set of templates updated andrelevant to the output of the system.

For the rest of the paper, we analyze the generated eventsseparately by extracting a signal for each of them and char-acterizing their behavior and the correlations between them.Figure 1 presents one template or event type for each typeof signals. First, we extract the signal for each event type bysampling the number of event occurrences for every time unitand afterwards we use wavelets and filtering to characterizethe normal behavior for each of them. In our experiments, weuse a sampling rate of 10 seconds for all signals. More detailsabout this step can be found in [4].

In the on-line phase, the signal creation module simplyconcatenates the existing signals with the information receivedfrom the input stream of events. For optimization purposes,we only keep the last two months in the on-line module sinceexecution time is an important factor in this phase. The outlinemonitor and the prediction system are applied on this trimmedand updated set of signals.

B. Analysis Modules

1) Outlier detection: All analysis modules are novel hybridmodules that combine data mining techniques with the previ-ously extracted set of signals and their characterization. Sincethe offline phase is not ran in real-time and the execution timeis not constrained, we did not optimize this step. For outlierdetection in the on-line phase, we use as input the adaptedset of signals and apply a simple data cleaning method foridentifying the erroneous data points.

We implement this step as a filtering signal analysis moduleso that is can be easily inserted between signal analysis

Gainaru, Cappello, Snir, Kramer (SC12)

Use a combination of signal analysis (to identify outliers) and datamining (to find correlations)

Can Hardware Failure Be Predicted?

21

Fig. 7. Percentage of sequences propagating on different racks, midplanesand nodes

Fig. 8. Prediction time window

propagate on multiple locations. For example the sequence:can not get assembly information for node cardlinkCard power module * is not accessibleno power module * found found on link cardgives information about a node card problem that is not fullyfunctional. Events marked as ”severe” and ”failure” occur afteraround one hour and report that the link card module is notaccessible from the same midplane and that the link card isnot found. The sequence is generated by the same node forall its occurrences in the log.

For 75% of correlations that do not propagate, the predictionsystem does not need to worry about finding the right locationthat will be affected by the failure. However, for the other 25%that propagate, a wrong prediction will lead to a decrease inboth precision and recall. We analyzed this a little further andobserved that for most propagation sequences the initiatingnode (the one where the first symptom occurs) is included inthe set of nodes affected by the failure. This leads us to believethat the recall of the prediction system will be more affectedby the location predictor than its precision.

VI. DISSECTING PREDICTION

Figure 8 shows an overview of the prediction process. Theobservation window is used for the outlier detection. Theanalysis time represent the overhead of our method in makinga prediction: the execution time for detecting the outlier,triggering a correlation sequence and finding the correspond-ing locations. The prediction window is the time delay untilthe predicted event will occur in the system. The predictionwindow starts right after the observation point but is visibleonly at the end of the analysis time.

In the next section we analyze the prediction based onthe visible prediction window and then propose an analyticalmodel for the impact of our results on checkpointing strategies.

Prediction method Precision Recall Seq used Pred failuresELSA hybrid 91.2% 45.8% 62 (96.8%) 603ELSA signal 88.1% 40.5% 117 (92.8%) 534Data mining 91.9% 15.7% 39 (95.1%) 207

TABLE IIPERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES

The metrics used for evaluating prediction performance areprediction and recall:

• Precision is the fraction of failure predictions that turnout to be correct.

• Recall is the fraction of failures that are predicted.

A. Analysis

In the on-line phase the analysis is composed of the outlierdetection and the module that triggers the predictions afterinspecting the correlation chains. We computed the executiontime for different regimes: during the normal execution of thesystem and during the periods that put the most stress on theanalysis, specifically periods with bursts of messages. If theincoming event type is already in an active correlation list, wedo not investigate it further since it will not give us additionalinformation.

The systems we analyzed generate in average 5 messagesper second and during bursts of messages the logs presentaround 100 messages per second. The analysis window isnegligible in the first case and around 2.5 second in the second.The worst case seen for these systems was 8.43 seconds duringan NFS failure on Mercury. By taking this analysis windowinto consideration we examined how many correlation chainsare actually used for predicting failures and which failures arewe able to detect before they occur.

Our previous work showed 43% recall and 93% precisionfor the LANL system by using a purely signal analysisapproach. However, at that point, we were not interested aboutpredicting the location where the fault might occur. In thispaper, we focus on both location and the prediction window.We compute the results only for the BlueGene/L systems andguided our metrics based on the severity field offered by thesystem.

We analyzed the number of sequences found with our initialsignal analysis approach, the data mining algorithm describedin [29] and the present hybrid method. Signal analysis givesa larger number of sequences, in general having a smalllength, making the analysis window higher. Also, the on-line outlier detection puts extra stress on the analysis makingthe analysis window exceed 30 seconds when the systemexperiences bursts. Due to our data mining extraction of multi-event correlation we were able to keep only the most frequentsubset making the on-line analysis work on a much lightercorrelation set. On the other extreme, the data mining approachlooses correlations between signals of different types, so evenif the correlation set is much smaller than our hybrid method,the false negative count is higher.

example if 25% of errors are predicted, the new mttf is 4mttf

3 .The rest of the failures are predicted events and have a meantime between them of mttf

N

seconds.By applying the new mttf for the un-predicted failures to

equation (2), the new optimal checkpoint interval becomes

T

optimum

=

r2C

mttf

1�N

(3)

The first two terms from equation (1) need to change toconsider only the un-predicted failures since for all the otherspreventive actions will be taken. By adding the first two termsand incorporating the value for the checkpoint interval fromequation (3), the minimum waste becomes:

W

recall

min

=

s2C(1�N)

mttf

+(R+D)

mttf

(4)

The last term from equation (1) will not change since for allfailures, both predicted and un-predicted, the application needsto be restarted. Additional to the waste from (4), each timean error is predicted, the application will take a checkpointand it will waste the time execution between this checkpointis taken to the occurrence of the failure. This value dependson the system the application is running on and can rangebetween a few seconds to even one hour. However, for thesystems we analyzed, in general, the time delay is very lowand for our model we consider that is negligible compared tothe checkpointing time. We add the waste of C seconds foreach predicted failure, which happens every mttf

N

seconds.After adding this waste equation (4) becomes:

W

recall

min

=

s2C(1�N)

mttf

+(R+D)

mttf

+CN

mttf

(5)

In the ideal case, when N=1, the minimum waste is equal tothe time to checkpoint right before every failure and the timeto restart after every failure. The formula assumes a perfectprecision. In case the precision is P, the waste value mustalso take into consideration the cases when the predictionis wrong. The predicted faults happen every mttf

N

secondsand they represent P of total predictions. This means thatthe rest of (1-P) false positives predictions will happen everyP

1�P

mttf

N

seconds. Each time a false positive is predicted, acheckpointing is taken that must be added to the total wastefrom equation (5):

W

recall

min

=

s2C(1�N)

mttf

+(R+D)

mttf

+CN

mttf

+CN(1� P )

Pmttf

(6)As an example, we consider the values used by [34] to

characterize current systems: R = 5, D = 1 in minutes andstudy two values for the time to checkpoint: C=1 minute andfrom [25] C=10 seconds. We computed the gain from usingthe prediction offered by our hybrid method with differentprecision and recall values and for different MTTFs. Table III

C Precision Recall MTTF for the whole system Waste gain1min 92 20 one day 9.13%1min 92 36 one day 17.33%10s 92 36 one day 12.09%10s 92 45 one day 15.63%

1min 92 50 5h 21.74%10s 92 65 5h 24.78%

TABLE IIIPERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES

presents the results. The first 4 cases present numbers from realsystems and checkpointing strategies. Interestingly, for futuresystems with a MFFT of 5h if the prediction can provide arecall over 50% then the waist time decreases by more than20%. For the future, we plan to combine a checkpointingstrategy with our prediction and study its effectiveness in realHPC systems.

VII. CONCLUSION

This paper investigates a novel way of analyzing log filesfrom large-scale systems, by combining two different analysistechniques: data mining and signal processing and using theadvantages given by both. We use signal analysis conceptsfor shaping the normal behaviour of each event type and ofthe whole system and characterizing the way faults affectthem. This way the models we use are more realistic in thatthey take into account the different behaviour of the eventsduring failure free execution and when failures occur. At thesame time we use data mining algorithms for analyzing thecorrelations between these behaviors since these algorithmsprove themselves more suited in characterizing the interactionsbetween different high dimensionality sets than the crosscorrelation function offered by signal analysis.

In our experiments we show that a more realistic model,like the one obtained with the hybrid method, influences theprediction results and in the end the efficacy of fault tolerancealgorithms is improved. We investigated both the lag timebetween the prediction moment and the time of occurrencefor the actual failure, taking into consideration the analysistime, and concluded that the proposed model could allowproactive actions to be taken. Moreover, since the locationof an error in an important part of a prediction system, weincluded in our prediction location analysis and studied itsimpact on the results. We will focus in the future on a moredetailed analysis of different error types for which our systemhas a low recall. Also, we plan to study to a wider extend, thepractical way the prediction system influences current faulttolerance mechanisms.

ACKNOWLEDGMENT

This work was supported in part by the DoE 9J-30281-0008A grant, and by the INRIA-Illinois Joint Laboratory forPetascale Computing.

Migrating processes when node failure is predicted can significantly improve utilization

How About Software Bugs?

•  Parallel code (transforma0onal, mostly determinis0c) vs. concurrent code (event driven, very nondeterminis0c) –  Hard to understand concurrent code (cannot comprehend more than 10 concurrent actors)

–  Hard to avoid performance bugs (e.g., overcommiOed resources causing 0me-‐outs)

–  Hard to test for performance bugs •  Concurrency performance bugs (e.g., in parallel file systems) are

major source of failures on current supercomputers –  Problem will worsen as we scale up – performance bugs become more frequent

•  Need to become beOer at avoiding performance bugs (learn from control theory and real-‐0me systems) – Make system code more determinis0c

22

System Recovery

•  Local failures (e.g., node kernel crashed) are not a major problem -‐-‐ Can replace

•  Global failures (e.g., global file system crashed) are the hardest problem -‐-‐ Need to avoid

•  Issue: Fault containment –  Local hardware failure corrupts global state –  Localized hardware error causes global performance bugs

23

Quid Custodiet Ipsos Custodes?

•  Who watches the watchmen?

•  Need robust (scalable, fault-‐tolerant) infrastructure for error repor0ng and recovery orchestra0on –  Current approach of out-‐of-‐band monitoring and control is too restricted

24

Summary

•  Resilience is a major problem in the exascale era •  Major todos:

–  Need to understand much beOer failures in current systems and future trends

–  Need to develop good applica0on error detec0on methodology; error correc0on is desirable, but less cri0cal

–  Need to significantly reduce system errors and reduce system recovery 0me

–  Need to develop robust infrastructure for error handling

25

26

The End

resilience at exascale

Documents