resilience at exascale

26
Resilience at Exascale Marc Snir Director, Mathema0cs and Computer Science Division Argonne Na0onal Laboratory Professor, Dept. of Computer Science, UIUC

Upload: marc-snir

Post on 20-May-2015

461 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resilience at exascale

Resilience at Exascale

Marc  Snir  Director,  Mathema0cs  and  Computer  Science  Division  Argonne  Na0onal  Laboratory    Professor,  Dept.  of  Computer  Science,  UIUC    

Page 2: Resilience at exascale

The Problem

•  “Resilience  is  the  black  swan  of  the  exascale  program”  •  DOE  &  DoD  commissioned  several  reports  

–  Inter-­‐Agency  Workshop  on  HPC  Resilience  at  Extreme  Scale  hOp://ins0tute.lanl.gov/resilience/docs/Inter-­‐AgencyResilienceReport.pdf    (Feb  2012)  –  U.S.  Department  of  Energy  Fault  Management  Workshop  hOp://shadow.dyndns.info/publica0ons/geist12department.pdf  (June  2012)  –  ICIS  workshop  on  “addressing  failures  in  exascale  compu0ng”  (report  forthcoming)  

•  Talk:  –  Discuss  the  liOle  we  understand  –  Discuss  how  we  could  learn  more  –  Seek  you  feedback  on  the  report’s  content      

2  

Page 3: Resilience at exascale

WHERE WE ARE TODAY

3  

Page 4: Resilience at exascale

Failures Today

•  Latest  large  systema0c  study  (Schroeder  &  Gibson)  published  in  2005  –  quite  obsolete  

•  Failures  per  node  per  year:  2-­‐20  •  Root  cause  of  failures  

–  Hardware  (30%-­‐60%),  SW  (5%-­‐25%),  Unknown  (20%-­‐30%)  •  Mean-­‐Time  to  Repair:  3-­‐24  hours  

•  Current  anecdotal  numbers:  –  Applica0on  crash:  >1  day  –  Global  system  crash:  >  1  week  –  Applica0on  checkpoint/restart:  15-­‐20  minutes  (checkpoint  oden  “free”)  

–  System  restart:  >  1  hour  –  Sod  HW  errors  suspected  

4  

Page 5: Resilience at exascale

Recommendation

•  Study  in  systema0c  manner  current  failure  types  and  rates  –  Using  error  logs  at  major  DOE  labs  –  Pushing  for  common  ontology  

•  Spend  0me  hun0ng  for  sod  HW  errors  •  Study  causes  of  errors  (?)  

– Most  HW  faults  (>99%)  have  no  effect  (processors  are  hugely  inefficient?)  

–  The  common  effect  of  sod  HW  bit  flips  is    SW  failure  –  HW  faults  can  (oden?)  be  due  to  design  bugs  –  Vendors  are  cagey  

•  Time  for  a  consor0um  effort?  

5  

Page 6: Resilience at exascale

Current Error-Handling

•  Applica0on:  global  checkpoint  &  global  restart  •  System:  

–  Repair  persistent  state  (file  system,  databases)  –  Clean  slate  restart  for  everything  else  

•  Quick  analysis:  –  Assume  failures  have  Poisson  distribu0on  checkpoint  0me  C=1;    recover+restart  0me  =  R;      MTBF=M  

–  Op0mal  checkpoint  interval  τ  sa0sfies  

–  System  u0liza0on  U  is  

6  

−τ /Me = (M −τ +1) /M

U = (M − R−τ +1) /M

Page 7: Resilience at exascale

Utilization, assuming R=1

•  EE.g.  –  C=R=15  mins  (=1)  – MTBF=  25  hours  (=100)  

–  U  ≈  85%  

7  

Page 8: Resilience at exascale

Utilization as Function of MTBF and R

8  

•  E.g.  –  C  =  15  mins  (=1)  –  R  =  1  hour  (=4)  – MTBF=  25  hours  (=100)  

•  U  ≈  80%  

Page 9: Resilience at exascale

Projecting Ahead

•  “Comfort  zone:”  –  Checkpoint  0me  <1%  MTBF  –  Repair  0me  <5%  MTBF  

•  Assume  MTBF  =  1  hour  –  Checkpoint  0me  ≈  0.5  minute  –  Repair  0me  ≈  3  minutes  

•  Is  this  doable?  –  Yes  if  done  in  memory  –  E.g.,  using  “RAID”  techniques  

 

9  

Page 10: Resilience at exascale

Exascale Design Point

Systems   2012  BG/Q  

Computer  

2020-­‐2024     Difference  Today  &  2019  

System  peak   20  Pflop/s   1  Eflop/s   O(100)  

Power   8.6  MW   ~20  MW  

System  memory   1.6  PB  (16*96*1024)    

32  -­‐  64  PB   O(10)  

Node  performance      205  GF/s  (16*1.6GHz*8)  

1.2    or  15TF/s   O(10)  –  O(100)  

Node  memory  BW   42.6  GB/s   2  -­‐  4TB/s   O(1000)  

Node  concurrency   64  Threads   O(1k)  or  10k   O(100)  –  O(1000)  

Total  Node  Interconnect  BW   20  GB/s   200-­‐400GB/s   O(10)  

System  size  (nodes)   98,304  (96*1024)  

O(100,000)  or  O(1M)   O(100)  –  O(1000)  

Total  concurrency   5.97  M   O(billion)   O(1,000)  

MTTI   4  days   O(<1  day)   -­‐  O(10)  

Both price and power envelopes may be too aggressive!

Page 11: Resilience at exascale

Time to checkpoint •  Assume  “Raid  5”  across  memories,  checkpoint  size  ~  50%  

memory  size  –  Checkpoint  0me  ≈  0me  to  transfer  50%  of  memory  to  another  node:  Few  seconds!  

– Memory  overhead    ~  50%  –  Energy  overhead  small  with  nVRAM  –  (But  need  many  write  cycles)  – We  are  in  comfort  zone  (re  checkpoint)  

11  

•  How  about  recovery?  •  Time  to  restart  applica0on  same  

as  0me  to  checkpoint  •  Problem:  Time  to  recover  if  

system  failed  

Page 12: Resilience at exascale

Key Assumptions:

 1.  Errors  are  detected  (before  the  erroneous  checkpoint  is  

commiOed)    

2.  System  failures  are  rare  (<<  1  day)  or  recovery  from  them  is  very  fast  (minutes)  

12  

Page 13: Resilience at exascale

Hardware Will Fail More Frequently

•  More,  smaller  transistor  •  Near-­‐threshold  logic  

☛ More  frequent  bit  upsets  due  to  radia0on  ☛ More  frequent  mul0ple  bit  upsets  due  to  radia0on  ☛ Larger  manufacturing  varia0on  ☛ Faster  aging            

13  

Page 14: Resilience at exascale

Hardware Error Detection: Assumptions

14  

3.4.1 Compute Node Soft Errors and Failures

Soft errors and failures in the compute node (processor and memory only; network, power, and coolingdiscussed later in this section) are a result of events that are entirely external to the system and cannot bereplicated. Furthermore, soft faults are transient in nature and leave no lasting impact on hardware. By farthe most significant source of soft faults are energetic particles that interact with the silicon substrate andeither flip the state of a storage element or disrupt the operation of a combinational logic circuit. The twocommon sources of particle strike faults are alpha particles that originate within the package and high-energyneutrons. When a high-energy neutron interacts with the silicon die, it creates a stream of secondary chargedparticles. These charged particles then further interact with the semi-conductor material, freeing electron-hole pairs. If the charged particle creates the electron-hole pairs within the active region of a transistor, acurrent pulse is formed. This current pulse can directly change the state of a storage device or can manifestas a wrong value at the end of a combinational logic chain. Alpha particles are charged and may directlycreate electron-hole pairs.

To model the impact a particle strike has on a compute node, we model the effect on each node compo-nent separately, namely: SRAM, latches, combinational logic, DRAM, and NV-RAM. We then determine arough estimates for the number of units of each component within the node. We use this estimate to providea very rough order-of-magnitude type fault rates for the compute node. We also briefly mention how suchfaults are handled in processors today and discuss how advances in process technology are expected to affectthese soft faults. We make projections for the impact of particle-strike soft errors on a future 11nm node,as well as present an estimate of the overhead/error-rate tradeoffs at the hardware level. The estimates arebased on the models below and on some assumptions about the components of a node, as shown in Table 2.

A few important caveats about the models and projections:

• The numbers summarized in the table below do not include hard errors, including intermittent harderrors. We expect intermittent hard errors and failures in hardware to be a significant contributor tosoftware-visible errors and failures.

• We do not have access to good models for the susceptibility of near-threshold circuits and do notconsider such designs.

• We give only a rough, order-of-magnitude at best, type estimate; many important factors remainunknown with respect to the 11nm technology node.

Table 2: Summary of assumptions on the components of a 45nm node and estimates of scaling to 11nm.45nm 11nm

Cores 8 128

Scattered latches per core 200, 000 200, 000

Scattered latchs in uncore relative to cores pncores

⇥ 1.25 = 0.44pncores

⇥ 1.25 = 0.11

FIT per latch 10�1 10�1

Arrays per core (MB) 1 1

FIT per SRAM cell 10�4 10�4

Logic FIT / latch FIT 0.1� 0.5 0.1� 0.5

DRAM FIT (per node) 50 50

12

Page 15: Resilience at exascale

Hardware Error Detection: Analysis

15  

Array interleaving and SECDED(Baseline)

DCE [FIT] DUE [FIT] UE [FIT]

45nm 11nm 45nm 11nm 45nm 11nmArrays 5000 100000 50 20000 1 1000

Scattered latches 200 4000 N/A N/A 20 400

Combinational logic 20 400 N/A N/A 0 4

DRAM 50 50 0.5 0.5 0.005 0.005

Total 1000 - 5000 100000 10 - 100 5000 - 20000 10 - 50 500 - 5000

Array interleaving and ¿SECDED(11nm overhead: ⇠ 1% area and < 5% power)

DCE [FIT] DUE [FIT] UE [FIT]

45nm 11nm 45nm 11nm 45nm 11nmArrays 5000 100000 50 1000 1 5

Scattered latches 200 4000 N/A N/A 20 400

Combinational logic 20 400 N/A N/A 0.2 5

DRAM 50 50 0.5 0.5 0.005 0.005

Total 1500 - 6500 100000 10 - 50 500 - 5000 10 - 50 100 - 500

Array interleaving and ¿SECDED + latch parity(45nm overhead ⇠ 10%; 11nm overhead: ⇠ 20% area and ⇠ 25% power)

DCE [FIT] DUE [FIT] UE [FIT]

45nm 11nm 45nm 11nm 45nm 11nmArrays 5000 100000 50 1000 1 5

Scattered latches 200 4000 20 400 0.01 0.5

Combinational logic 20 400 N/A N/A 0.2 5

DRAM 0 0 0.1 0.0 0.100 0.001

Total 1500 - 6500 100000 25 - 100 2000 - 10000 1 5 - 20

Table 3: Summary of per-processor particle-strike soft error characteristics within a compute node (sea level,equator). Note that other sources of transient faults cannot be ignored.

SRAM. Large SRAM arrays dominate the raw particle-strike fault rate of a processor silicon die. When aparticle strike releases charge within an active region of a transistor in an SRAM cell, the charge collectedmay exceed the charge required to change the value stored in the cell, causing a single event upset (SEU). AnSEU may impact a single SRAM cell or may change the values of multiple adjacent cells. Such multi-cellupsets (MCUs) are also called burst errors. A reasonable ball-park number for SRAM particle-strike upsetrate is 1 upset every 107 hours for 1Mb of capacity, which is a rate of 10�4 FIT/bit. Our best estimatesindicate that the SEU rate for SRAM will remain roughly constant as technology scales. While manycomplex phenomena impact susceptibility, the current roadmap of changes to devices, operating voltage,and scale, do not point to extreme changes in susceptibility. What is expected to change is the distributionof MCUs, with a single upset more likely to affect longer burst of cells at smaller scales.

Because the raw FIT/chip from SRAM is high (estimated at roughly 0.5 upsets per year per chip, ormultiple upsets an hour in a large-scale HPC system), large arrays are protected with error detection anderror correction capabilities. An approach in use today is a combination of physical word interleavingcoupled with an error detection code (EDC) or with error checking and correcting (ECC) mechanisms.Given the distribution of MCUs today, 4-way interleaving with SECDED capabilities per array line aresufficient. Stronger capabilities will likely be needed in the future, but their energy and area overhead isexpected to be low (see Table ??). Note that our estimates assume that 4-bit or longer bursts increase from1% of all SEUs to 10% or higher between 45nm and 11nm technology and that the rate of bursts of 8 bits orlonger increases from 0.01% of all SEUs to 1% of all SEUs.

Note that alternative storage technology with much lower particle-strike error rates is possible. Somecurrent processors use embedded DRAM for large arrays and there is a possibility that future processorswill use on-chip arrays of non-volatile storage. Embedded DRAM has a 100 times or more lower error ratethan SRAM. Non-volatile storage cells are immune to particle strikes, but do display some soft error fault

13

Page 16: Resilience at exascale

Summary of (Rough) Analysis

•  If  no  new  technology  is  deployed  can  have  up  to  one  undetected  error  per  hour  

•  With  addi0onal  circuitry  could  get  down  to  one  undetected  error  per  100-­‐1,000  hours  (week  –  months)  –  The  cost  could  be  as  high  as  20%  addi0onal  circuits  and  25%  addi0onal  power  

– Main  problems  are  in  combinatorial  logic  and  latches  –  The  price  could  be  significantly  higher  (small  market  for  high-­‐availability  servers)  

•  Need  sodware  error  detec0on  as  an  op0on!      

16  

Page 17: Resilience at exascale

Application-Level Data Error Detection •  Check  checkpoint  for  correctness  before  it  is  commiOed.  

–  Look  for  outliers  (assume  smooth  fields)  –  handle  high  order  bit  errors  

–  Use  robust  solvers  –  handle  low-­‐order  bit  errors  •  May  not  work  for  discrete  problems,  parBcle  simulaBons,  

disconBnuous  phenomena,  etc.    •  May  not  work  if  outlier  is  rapidly  smoothed  (error  propagaBon)  

–  Check  for  global  invariants  (e.g.,  energy  preserva0on)  •   Ignore  error  •  Duplicate  computa0on  of  cri0cal  variables  

•  Can  we  reduce  checkpoint  size  (or  avoid  them  altogether)?  –  Tradeoff:  how  much  is  saved,  how  oden  data  is  checked  (big/small  transac0ons)  

 17  

Page 18: Resilience at exascale

How About Control Errors?

•  Code  is  corrupted,  jump  to  wrong  address,  etc.  –  Rarer  (more  data  state  than  control  state)  – More  likely  to  cause  a  fatal  excep0on  – More  easy  to  protect  against  

 

18  

Page 19: Resilience at exascale

How About System Failures (Due to Hardware)?

•  Kernel  data  corrupted  –  Page  tables,  rou0ng  tables,  file  system  metadata,  etc.  

•  Hard  to  understand  and  diagnose    –  Break  abstrac0on  layers  

•  Make  sense  to  avoid  such  failures  –  Duplicate  system  computa0ons  that  affect  kernel  state  (price  can  be  tolerated  for  HPC)  

–  Use  transac0onal  methods  –  Harden  kernel  data    

•  Highly  reliable  DRAM  •  Remotely  accessible  NVRAM  

–  Predict  and  avoid  failures  

19  

Page 20: Resilience at exascale

Failure Prediction from Event Logs

20  

types of signals: periodic, noise and silent. Figure 1 presentsthe three types and the possible cause for each type.

We observed that a fault trigger in the system does not havea consistent representation in the logs. For example, a memoryfailure will cause the faulty module to generate a large numberof messages. Conversely, in case of a node crash the errorwill be characterized by a lack of notifications. Data miningalgorithms in general assume that faults manifest themselfs inthe same way and in consequence fail to handle more thanone type behaviors.

For example, even though silent signals represent the ma-jority of event types, data mining algorithms fail to extractthe correlation between them and other types of signals. Thisaffects fault prediction in both the total number of faults seenby the method and in the time delay offered between theprediction and the actual occurrence of the fault.

Signal analysis methods concepts can handle all three signaltypes, and thus provide a larger set of correlations that can beused for prediction. However, data mining algorithms are moresuited in characterizing correlations between different highdimensionality sets than the cross correlation function offeredby signal analysis. Data mining is a powerful technology thatconverts raw data into an understandable and actionable form,which can then be used to predict future trends or providemeaning to historical events.

Additionally, outlier detection has a rich research historyin incorporating both statistical and data mining methods fordifferent types of datasets. Moreover, they are able to implic-itly adapt to changes in the dataset and to apply thresholdbased distance measures separating outliers from the bulk ofgood observations. In this paper, we combine the advantagesof both methods in order to offer a hybrid approach capable ofcharacterizing different behaviors given by events generated bya HPC system and providing an adaptive forecasting methodby using latest data mining techniques.

In the following sections we present the methodology usedfor preprocessing the log files and extracting the signals andthen we introduce the novel hybrid method that combinessignal analysis concepts with data mining techniques foroutlier detection and correlation extraction. An overview ofthe methodology is presented in figure 2.

A. Preprocessing

Log files generated by large HPC system contain millionof message lines, making their manual analysis impossible.Moreover, the format in which messages are generated is notstructured and differs between different system and sometimeseven between different components of the same machine. Inorder to bring structure to our analysis, we extract the descrip-tion of all generated messages. These descriptions representthe characterization of different events used by the system.Also, as software changes with versions, bug fixes or driverupdates, these descriptions are modified to reflect the system’soutput at all time.

For this, we performed an initial pass over the logs files toidentify frequently occurring messages with similar syntactic

Fig. 2. Methodology overview of the hybrid approach

patterns. Specifically, we use the Hierarchical Event LogOrganizer [15] on the raw logs, resulting in a list of messagetemplates. These templates represent regular expressions thatdescribe a set of syntactically related messages and definedifferent events in a system. In the on-line phase, we useHELO on-line to keep the set of templates updated andrelevant to the output of the system.

For the rest of the paper, we analyze the generated eventsseparately by extracting a signal for each of them and char-acterizing their behavior and the correlations between them.Figure 1 presents one template or event type for each typeof signals. First, we extract the signal for each event type bysampling the number of event occurrences for every time unitand afterwards we use wavelets and filtering to characterizethe normal behavior for each of them. In our experiments, weuse a sampling rate of 10 seconds for all signals. More detailsabout this step can be found in [4].

In the on-line phase, the signal creation module simplyconcatenates the existing signals with the information receivedfrom the input stream of events. For optimization purposes,we only keep the last two months in the on-line module sinceexecution time is an important factor in this phase. The outlinemonitor and the prediction system are applied on this trimmedand updated set of signals.

B. Analysis Modules

1) Outlier detection: All analysis modules are novel hybridmodules that combine data mining techniques with the previ-ously extracted set of signals and their characterization. Sincethe offline phase is not ran in real-time and the execution timeis not constrained, we did not optimize this step. For outlierdetection in the on-line phase, we use as input the adaptedset of signals and apply a simple data cleaning method foridentifying the erroneous data points.

We implement this step as a filtering signal analysis moduleso that is can be easily inserted between signal analysis

Gainaru, Cappello, Snir, Kramer (SC12)

Use a combination of signal analysis (to identify outliers) and datamining (to find correlations)

Page 21: Resilience at exascale

Can Hardware Failure Be Predicted?

21  

Fig. 7. Percentage of sequences propagating on different racks, midplanesand nodes

Fig. 8. Prediction time window

propagate on multiple locations. For example the sequence:can not get assembly information for node cardlinkCard power module * is not accessibleno power module * found found on link cardgives information about a node card problem that is not fullyfunctional. Events marked as ”severe” and ”failure” occur afteraround one hour and report that the link card module is notaccessible from the same midplane and that the link card isnot found. The sequence is generated by the same node forall its occurrences in the log.

For 75% of correlations that do not propagate, the predictionsystem does not need to worry about finding the right locationthat will be affected by the failure. However, for the other 25%that propagate, a wrong prediction will lead to a decrease inboth precision and recall. We analyzed this a little further andobserved that for most propagation sequences the initiatingnode (the one where the first symptom occurs) is included inthe set of nodes affected by the failure. This leads us to believethat the recall of the prediction system will be more affectedby the location predictor than its precision.

VI. DISSECTING PREDICTION

Figure 8 shows an overview of the prediction process. Theobservation window is used for the outlier detection. Theanalysis time represent the overhead of our method in makinga prediction: the execution time for detecting the outlier,triggering a correlation sequence and finding the correspond-ing locations. The prediction window is the time delay untilthe predicted event will occur in the system. The predictionwindow starts right after the observation point but is visibleonly at the end of the analysis time.

In the next section we analyze the prediction based onthe visible prediction window and then propose an analyticalmodel for the impact of our results on checkpointing strategies.

Prediction method Precision Recall Seq used Pred failuresELSA hybrid 91.2% 45.8% 62 (96.8%) 603ELSA signal 88.1% 40.5% 117 (92.8%) 534Data mining 91.9% 15.7% 39 (95.1%) 207

TABLE IIPERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES

The metrics used for evaluating prediction performance areprediction and recall:

• Precision is the fraction of failure predictions that turnout to be correct.

• Recall is the fraction of failures that are predicted.

A. Analysis

In the on-line phase the analysis is composed of the outlierdetection and the module that triggers the predictions afterinspecting the correlation chains. We computed the executiontime for different regimes: during the normal execution of thesystem and during the periods that put the most stress on theanalysis, specifically periods with bursts of messages. If theincoming event type is already in an active correlation list, wedo not investigate it further since it will not give us additionalinformation.

The systems we analyzed generate in average 5 messagesper second and during bursts of messages the logs presentaround 100 messages per second. The analysis window isnegligible in the first case and around 2.5 second in the second.The worst case seen for these systems was 8.43 seconds duringan NFS failure on Mercury. By taking this analysis windowinto consideration we examined how many correlation chainsare actually used for predicting failures and which failures arewe able to detect before they occur.

Our previous work showed 43% recall and 93% precisionfor the LANL system by using a purely signal analysisapproach. However, at that point, we were not interested aboutpredicting the location where the fault might occur. In thispaper, we focus on both location and the prediction window.We compute the results only for the BlueGene/L systems andguided our metrics based on the severity field offered by thesystem.

We analyzed the number of sequences found with our initialsignal analysis approach, the data mining algorithm describedin [29] and the present hybrid method. Signal analysis givesa larger number of sequences, in general having a smalllength, making the analysis window higher. Also, the on-line outlier detection puts extra stress on the analysis makingthe analysis window exceed 30 seconds when the systemexperiences bursts. Due to our data mining extraction of multi-event correlation we were able to keep only the most frequentsubset making the on-line analysis work on a much lightercorrelation set. On the other extreme, the data mining approachlooses correlations between signals of different types, so evenif the correlation set is much smaller than our hybrid method,the false negative count is higher.

example if 25% of errors are predicted, the new mttf is 4mttf

3 .The rest of the failures are predicted events and have a meantime between them of mttf

N

seconds.By applying the new mttf for the un-predicted failures to

equation (2), the new optimal checkpoint interval becomes

T

optimum

=

r2C

mttf

1�N

(3)

The first two terms from equation (1) need to change toconsider only the un-predicted failures since for all the otherspreventive actions will be taken. By adding the first two termsand incorporating the value for the checkpoint interval fromequation (3), the minimum waste becomes:

W

recall

min

=

s2C(1�N)

mttf

+(R+D)

mttf

(4)

The last term from equation (1) will not change since for allfailures, both predicted and un-predicted, the application needsto be restarted. Additional to the waste from (4), each timean error is predicted, the application will take a checkpointand it will waste the time execution between this checkpointis taken to the occurrence of the failure. This value dependson the system the application is running on and can rangebetween a few seconds to even one hour. However, for thesystems we analyzed, in general, the time delay is very lowand for our model we consider that is negligible compared tothe checkpointing time. We add the waste of C seconds foreach predicted failure, which happens every mttf

N

seconds.After adding this waste equation (4) becomes:

W

recall

min

=

s2C(1�N)

mttf

+(R+D)

mttf

+CN

mttf

(5)

In the ideal case, when N=1, the minimum waste is equal tothe time to checkpoint right before every failure and the timeto restart after every failure. The formula assumes a perfectprecision. In case the precision is P, the waste value mustalso take into consideration the cases when the predictionis wrong. The predicted faults happen every mttf

N

secondsand they represent P of total predictions. This means thatthe rest of (1-P) false positives predictions will happen everyP

1�P

mttf

N

seconds. Each time a false positive is predicted, acheckpointing is taken that must be added to the total wastefrom equation (5):

W

recall

min

=

s2C(1�N)

mttf

+(R+D)

mttf

+CN

mttf

+CN(1� P )

Pmttf

(6)As an example, we consider the values used by [34] to

characterize current systems: R = 5, D = 1 in minutes andstudy two values for the time to checkpoint: C=1 minute andfrom [25] C=10 seconds. We computed the gain from usingthe prediction offered by our hybrid method with differentprecision and recall values and for different MTTFs. Table III

C Precision Recall MTTF for the whole system Waste gain1min 92 20 one day 9.13%1min 92 36 one day 17.33%10s 92 36 one day 12.09%10s 92 45 one day 15.63%

1min 92 50 5h 21.74%10s 92 65 5h 24.78%

TABLE IIIPERCENTAGE WASTE IMPROVEMENT IN CHECKPOINTING STRATEGIES

presents the results. The first 4 cases present numbers from realsystems and checkpointing strategies. Interestingly, for futuresystems with a MFFT of 5h if the prediction can provide arecall over 50% then the waist time decreases by more than20%. For the future, we plan to combine a checkpointingstrategy with our prediction and study its effectiveness in realHPC systems.

VII. CONCLUSION

This paper investigates a novel way of analyzing log filesfrom large-scale systems, by combining two different analysistechniques: data mining and signal processing and using theadvantages given by both. We use signal analysis conceptsfor shaping the normal behaviour of each event type and ofthe whole system and characterizing the way faults affectthem. This way the models we use are more realistic in thatthey take into account the different behaviour of the eventsduring failure free execution and when failures occur. At thesame time we use data mining algorithms for analyzing thecorrelations between these behaviors since these algorithmsprove themselves more suited in characterizing the interactionsbetween different high dimensionality sets than the crosscorrelation function offered by signal analysis.

In our experiments we show that a more realistic model,like the one obtained with the hybrid method, influences theprediction results and in the end the efficacy of fault tolerancealgorithms is improved. We investigated both the lag timebetween the prediction moment and the time of occurrencefor the actual failure, taking into consideration the analysistime, and concluded that the proposed model could allowproactive actions to be taken. Moreover, since the locationof an error in an important part of a prediction system, weincluded in our prediction location analysis and studied itsimpact on the results. We will focus in the future on a moredetailed analysis of different error types for which our systemhas a low recall. Also, we plan to study to a wider extend, thepractical way the prediction system influences current faulttolerance mechanisms.

ACKNOWLEDGMENT

This work was supported in part by the DoE 9J-30281-0008A grant, and by the INRIA-Illinois Joint Laboratory forPetascale Computing.

Migrating processes when node failure is predicted can significantly improve utilization

Page 22: Resilience at exascale

How About Software Bugs?

•  Parallel  code  (transforma0onal,  mostly  determinis0c)  vs.  concurrent  code  (event  driven,  very  nondeterminis0c)  –  Hard  to  understand  concurrent  code  (cannot  comprehend  more  than  10  concurrent  actors)  

–  Hard  to  avoid  performance  bugs  (e.g.,  overcommiOed  resources  causing  0me-­‐outs)  

–  Hard  to  test  for  performance  bugs  •  Concurrency  performance  bugs  (e.g.,  in  parallel  file  systems)  are  

major  source  of  failures  on  current  supercomputers  –  Problem  will  worsen  as  we  scale  up  –  performance  bugs  become  more  frequent  

•  Need  to  become  beOer  at  avoiding  performance  bugs  (learn  from  control  theory  and  real-­‐0me  systems)  – Make  system  code  more  determinis0c  

22  

Page 23: Resilience at exascale

System Recovery

•  Local  failures  (e.g.,  node  kernel  crashed)  are  not  a  major  problem  -­‐-­‐  Can  replace  

•  Global  failures  (e.g.,  global  file  system  crashed)  are  the  hardest  problem  -­‐-­‐  Need  to  avoid  

•  Issue:  Fault  containment  –  Local  hardware  failure  corrupts  global  state  –  Localized  hardware  error  causes  global  performance  bugs    

23  

Page 24: Resilience at exascale

Quid Custodiet Ipsos Custodes?

•  Who  watches  the  watchmen?  

•  Need  robust  (scalable,  fault-­‐tolerant)  infrastructure  for  error  repor0ng  and  recovery  orchestra0on  –  Current  approach  of  out-­‐of-­‐band  monitoring  and  control  is  too  restricted  

24  

Page 25: Resilience at exascale

Summary

•  Resilience  is  a  major  problem  in  the  exascale  era  •  Major  todos:  

–  Need  to  understand  much  beOer  failures  in  current  systems  and  future  trends  

–  Need  to  develop  good  applica0on  error  detec0on  methodology;  error  correc0on  is  desirable,  but  less  cri0cal  

–  Need  to  significantly  reduce  system  errors  and  reduce  system  recovery  0me  

–  Need  to  develop  robust  infrastructure  for  error  handling  

25  

Page 26: Resilience at exascale

26  

The End