justin w. eggstaff thomas a. mazzuchi shahram …...justin w. eggstaff thomas a. mazzuchi shahram...

Justin W. Eggstaff Thomas A. Mazzuchi

Shahram Sarkani

J. W. Eggstaff, T. A. Mazzuchi, and S. Sarkani. “The development of progress plans using a performance-based expert judgment model to assess technical performance and risk”. Systems Engineering Volume 16 Number 2 in 2014.

J. W. Eggstaff, T. A. Mazzuchi, and S. Sarkani. “The effect of the number of seed variables on the performance of Cooke’s Classical Model”, Reliability Engineering and Systems Safety. – 2nd Revision

 Each year, annual costs of DoD research & development (R&D) are approximately 50% above original estimates

 Typical delays in weapons systems initial operational capability (IOC) are in excess of 20 months

 Weapons Systems Acquisition Reform Act of 2009

2 of 35

 Overall program performance depends on three factors: • Cost •  Schedule •  Technical

 Technical performance is typically “assumed”

 Poor cost and schedule performance are symptoms or effects that manifest from poor technical performance

 Current methods for the predication of

3 of 35

 Technical Measurement •  Types of technical measures •  Attributes •  Technical reviews and audits

4 of 35

 Designed to provide a numerical value of risk by the comparison of current TPM progress against a desired level or performance, or a performance threshold, predefined by the analyst

 Category A – Smaller the Better (Software Errors)

 Category B – Larger the Better (system Range)

5 of 35

 Overall Risk

6 of 35

7 of 35

•  Probably the most widely used method for combining expert judgment in a variety of applications

• Uses a set of seed variables to calculate individual expert Calibration and Information scores which, in turn, are used to calculate an expert’s relative weight

•  The experts’ predicted values for a target variable are combined using their individual weights to calculate the decision maker’s assessment of that variable

 Experts assess their uncertainty distribution via specification of a 5%, 50% and 95%-ile values for unknown values and for a set of seed variables (whose actual realization is known to the analyst alone) and a set of variables of interest

 The analyst determines the Intrinsic Range or bounds for the variable distributions

•  By specifying the 5%, 50% and 95%-iles, the expert is specifying a 4-bin multinomial distribution with probabilities .05, .45, .45, and .05 for each seed variable response

•  Let si denote the observed bin frequency of seed variables

• We may test how well the expert is calibrated by testing the hypothesis that   H0 si = pi for all i vs Ha si ≠ pi for some i

 Test Statistic

  If N (the number of seed variables) is large enough

 Thus the calibration score for the expert is the probability of getting a relative information score worse (greater or equal to) than what was obtained

 The relative information for expert e on a variable is

•  total weight for the expert is the normalized product of calibration times information score

•  the calibration score is optimized by choosing A minimum α value such that if C(e) > α, C(e) = 0

•  α is selected so that a fictitious expert with a distribution equal to that of the weighted combination of expert distributions would be given the highest weight among experts

•  Final uncertainty distribution = Σ wiFi(x)

 Three reasons for an iterative cross-validation analysis •  The Classical Model uses a set of seed

variables to develop expert weights; an iterative approach is needed

•  The question of the minimum number of seed variables required has not been answered

•  The ongoing debate over the robustness of the Classical Model (performance weights versus equal weights)

14 of 35

  Cooke and Goossen (2008) •  Examines 45 expert judgment studies compiled over 20

years   Clemen (2008)

•  Asserts “in-sample” analysis is biased toward the classical model; Suggests the use of “out-of-sample/Remove-One-At-a-Time (ROAT)” analysis

•  Selected 14 studies to compare the performance-weighted (PW) decision maker and the equally-weighted (EW) decision maker

15 of 12

  Cooke (2008) •  Notes that a ROAT approach tends to favor or punish

excluded experts and presents a “two-fold” cross validation

•  In 20 of 26 validation runs, the PW outperformed the EW   Lin and Cheng (2008); (2009)

•  Using out-of-sample analysis, examines the available 45 studies and finds that the PW outperforms the EW, but with degraded performance

  Flandoli et al (2010) •  Performs a modified “two-fold” cross validation with 500

combinations of 30-70 splits •  Results show the Cooke’s model gives best indication of

uncertainty when averaged 16 of 12

 Analysis conducted • Comprehensive “Out-of-Sample” analysis • One-tailed sign test (Clemen, 2008)

 Data used •  55 expert judgment studies compiled over 20

years •  63 data sets: 604 experts, 770 seed

variables, ~68M judgments

17 of 35

Iteration Seed Variables Used Target Variables Evaluated

1 1 2 3 4

2 2 1 3 4

3 3 1 2 4

4 4 1 2 3

5 1 2 3 4

6 1 3 2 4

7 1 4 2 3

8 2 3 1 4

9 2 4 1 3

10 3 4 1 2

11 1 2 3 4

12 1 2 4 3

13 1 3 4 2

14 2 3 4 1

18 of 35

Extent of previous cross-validation studies

Mean Out-of-Sample Combination Scores (Calibration × Information)

19 of 35

Study ID No. of Experts

No. of Variables

DM Type

No. of Variables Used to Determine Performance Measure

1 2 3 4 5 6 7 8

MVOSEEDS 77 5 PWDM EWDM

0.3259 0.0279

0.5579 0.1154

0.6773 0.3071

0.8414 0.6963

A_SEED 7 6 PWDM EWDM

0.1434 0.0072

0.3312 0.0229

0.3462 0.0580

0.3332 0.1260

0.4439 0.2508

AOTDAILY 7 6 PWDM EWDM

0.0167 0.0164

0.0294 0.0313

0.0583 0.0586

0.1199 0.1036

0.2271 0.1565

FCEP 5 8 PWDM EWDM

0.0028 0.0001

0.5309 0.0008

0.7328 0.0038

0.8917 0.0135

1.0556 0.0399

1.0792 0.1059

1.1396 0.2434

BSWAAL 6 8 PWDM EWDM

0.3811 0.2697

0.2538 0.3142

0.3624 0.3458

0.3958 0.3688

0.3932 0.3862

0.3665 0.3900

0.4860 0.4406

DSM-1 10 8 PWDM EWDM

0.1546 0.2637

0.2075 0.2939

0.2448 0.3105

0.3224 0.3241

0.4849 0.3403

0.6048 0.3576

0.6591 0.4508

MONT1 11 8 PWDM EWDM

0.6249 0.2312

0.6168 0.2880

0.5673 0.3497

0.5964 0.4158

0.6656 0.4854

0.6350 0.5734

0.6423 0.7321

SO3EXPTS 4 9 PWDM EWDM

0.0123 2.9E-5

0.1847 0.0002

0.3236 0.0013

0.5801 0.0063

0.7460 0.0254

0.9834 0.0856

1.0993 0.2407

2.1950 0.5700

WATERPOL 11 9 PWDM EWDM

0.0115 0.0033

0.1661 0.0111

0.4032 0.0313

0.5544 0.0687

0.6987 0.1195

0.8737 0.1798

0.9985 0.2624

1.0289 0.4852

Single Decision Maker Dominates in 28 of 63 Cases PWDM: 21 Cases EWDM: 7 Cases

Single Modal Switching in 22 of 63 Cases EWDM gives way to PWDM: 10 Cases PWDM gives way to EWDM: 12 Cases

Dual Modal Switching (Parabolic) in 11 of 63 Cases PWDM at the extremes: 7 Cases EWDM at the extremes: 4 Cases Somewhat Random Switching in 2 of 63 Cases BSWAAL ACNEXPTS

Mean Out-of-Sample Combination Scores (Calibration × Information)

20 of 35

Study ID No. of Experts

No. of Variables

DM Type

No. of Variables Used to Determine Performance Measure

1 2 3 4 5 6 7 8

MVOSEEDS 77 5 PWDM EWDM

0.3259 0.0279

0.5579 0.1154

0.6773 0.3071

0.8414 0.6963

A_SEED 7 6 PWDM EWDM

0.1434 0.0072

0.3312 0.0229

0.3462 0.0580

0.3332 0.1260

0.4439 0.2508

AOTDAILY 7 6 PWDM EWDM

0.0167 0.0164

0.0294 0.0313

0.0583 0.0586

0.1199 0.1036

0.2271 0.1565

FCEP 5 8 PWDM EWDM

0.0028 0.0001

0.5309 0.0008

0.7328 0.0038

0.8917 0.0135

1.0556 0.0399

1.0792 0.1059

1.1396 0.2434

BSWAAL 6 8 PWDM EWDM

0.3811 0.2697

0.2538 0.3142

0.3624 0.3458

0.3958 0.3688

0.3932 0.3862

0.3665 0.3900

0.4860 0.4406

DSM-1 10 8 PWDM EWDM

0.1546 0.2637

0.2075 0.2939

0.2448 0.3105

0.3224 0.3241

0.4849 0.3403

0.6048 0.3576

0.6591 0.4508

MONT1 11 8 PWDM EWDM

0.6249 0.2312

0.6168 0.2880

0.5673 0.3497

0.5964 0.4158

0.6656 0.4854

0.6350 0.5734

0.6423 0.7321

SO3EXPTS 4 9 PWDM EWDM

0.0123 2.9E-5

0.1847 0.0002

0.3236 0.0013

0.5801 0.0063

0.7460 0.0254

0.9834 0.0856

1.0993 0.2407

2.1950 0.5700

WATERPOL 11 9 PWDM EWDM

0.0115 0.0033

0.1661 0.0111

0.4032 0.0313

0.5544 0.0687

0.6987 0.1195

0.8737 0.1798

0.9985 0.2624

1.0289 0.4852

Accuracy Measures (One-tailed Sign Test)

21 of 35

Median p-value by data set PWDM is more accurate than EWDM (p ≥ 0.5): 42 of 63 cases EWDM is more accurate than PWDM (p < 0.5): 11 of 63 cases Overall median p-value: 0.74

Median p-value by number of seed variables PWDM is more accurate in ALL cases PWDM is significantly more accurate in all but two cases

  Data Used •  The data set for this research comes from the

unpublished white paper by Coleman, Kulick, and Pisano (1996) on the T45TS Cockpit-21 project

•  Actual data used simulated for use in the expert judgment model

22 of 35 Appendix B

23 of 35 Figure 4-3, p. 112 E-TRI Flow Diagram

  Oct-93 Milestone Review: Nov-93 TPM value is predicted

24 of 35

  Nov-93 Milestone Review: TPM value is realized

  Expert weights are determined

25 of 35

  Decision maker’s assessment is calculated •  Using weighted expert predictions •  Calculated for all remaining milestones

26 of 35

  Nov-93 updated prediction is presented

27 of 35

  E-TRI for final state (Feb-94) is calculated

28 of 35

  Case Study Data

29 of 35

  System E-TRI for final state (Feb-94) is calculated

30 of 35

QUESTIONS?

  Garvey, P. R., & Cho, C.-C. (2003). An Index to Measure a System's Performance Risk. Acquisition Review Quarterly, Spring, 189-199.

  Winkler, R. L. (1968). The consensus of subjective probability distributions. Management Science, 15(2), 61-75.

  Cooke, R. M. (1991). Experts in uncertainty: opinion and subjective probability in science. New York: Oxford University Press.

  Clemen, R. T. (2008). Comment on Cooke's classical method. Reliability Engineering & System Safety, 93(5), 760-765.

  Coleman, C., Kulick, K., & Pisano, N. (1996). Technical performance measurement (TPM) retrospective implementation and concept validation on the T45TS Cockpit-21 program. Program Executive Office for Air Anti-Submarine Warfare, Assault, and Special Mission Programs, White Paper.

32 of 35

justin w. eggstaff thomas a. mazzuchi shahram …...justin w. eggstaff thomas a. mazzuchi shahram...

Documents