justin w. eggstaff thomas a. mazzuchi shahram …...justin w. eggstaff thomas a. mazzuchi shahram...
TRANSCRIPT
Justin W. Eggstaff Thomas A. Mazzuchi
Shahram Sarkani
J. W. Eggstaff, T. A. Mazzuchi, and S. Sarkani. “The development of progress plans using a performance-based expert judgment model to assess technical performance and risk”. Systems Engineering Volume 16 Number 2 in 2014.
J. W. Eggstaff, T. A. Mazzuchi, and S. Sarkani. “The effect of the number of seed variables on the performance of Cooke’s Classical Model”, Reliability Engineering and Systems Safety. – 2nd Revision
Each year, annual costs of DoD research & development (R&D) are approximately 50% above original estimates
Typical delays in weapons systems initial operational capability (IOC) are in excess of 20 months
Weapons Systems Acquisition Reform Act of 2009
2 of 35
Overall program performance depends on three factors: • Cost • Schedule • Technical
Technical performance is typically “assumed”
Poor cost and schedule performance are symptoms or effects that manifest from poor technical performance
Current methods for the predication of
3 of 35
Technical Measurement • Types of technical measures • Attributes • Technical reviews and audits
4 of 35
Designed to provide a numerical value of risk by the comparison of current TPM progress against a desired level or performance, or a performance threshold, predefined by the analyst
Category A – Smaller the Better (Software Errors)
Category B – Larger the Better (system Range)
5 of 35
Overall Risk
6 of 35
7 of 35
• Probably the most widely used method for combining expert judgment in a variety of applications
• Uses a set of seed variables to calculate individual expert Calibration and Information scores which, in turn, are used to calculate an expert’s relative weight
• The experts’ predicted values for a target variable are combined using their individual weights to calculate the decision maker’s assessment of that variable
Experts assess their uncertainty distribution via specification of a 5%, 50% and 95%-ile values for unknown values and for a set of seed variables (whose actual realization is known to the analyst alone) and a set of variables of interest
The analyst determines the Intrinsic Range or bounds for the variable distributions
• By specifying the 5%, 50% and 95%-iles, the expert is specifying a 4-bin multinomial distribution with probabilities .05, .45, .45, and .05 for each seed variable response
• Let si denote the observed bin frequency of seed variables
• We may test how well the expert is calibrated by testing the hypothesis that H0 si = pi for all i vs Ha si ≠ pi for some i
Test Statistic
If N (the number of seed variables) is large enough
Thus the calibration score for the expert is the probability of getting a relative information score worse (greater or equal to) than what was obtained
The relative information for expert e on a variable is
• total weight for the expert is the normalized product of calibration times information score
• the calibration score is optimized by choosing A minimum α value such that if C(e) > α, C(e) = 0
• α is selected so that a fictitious expert with a distribution equal to that of the weighted combination of expert distributions would be given the highest weight among experts
• Final uncertainty distribution = Σ wiFi(x)
Three reasons for an iterative cross-validation analysis • The Classical Model uses a set of seed
variables to develop expert weights; an iterative approach is needed
• The question of the minimum number of seed variables required has not been answered
• The ongoing debate over the robustness of the Classical Model (performance weights versus equal weights)
14 of 35
Cooke and Goossen (2008) • Examines 45 expert judgment studies compiled over 20
years Clemen (2008)
• Asserts “in-sample” analysis is biased toward the classical model; Suggests the use of “out-of-sample/Remove-One-At-a-Time (ROAT)” analysis
• Selected 14 studies to compare the performance-weighted (PW) decision maker and the equally-weighted (EW) decision maker
15 of 12
Cooke (2008) • Notes that a ROAT approach tends to favor or punish
excluded experts and presents a “two-fold” cross validation
• In 20 of 26 validation runs, the PW outperformed the EW Lin and Cheng (2008); (2009)
• Using out-of-sample analysis, examines the available 45 studies and finds that the PW outperforms the EW, but with degraded performance
Flandoli et al (2010) • Performs a modified “two-fold” cross validation with 500
combinations of 30-70 splits • Results show the Cooke’s model gives best indication of
uncertainty when averaged 16 of 12
Analysis conducted • Comprehensive “Out-of-Sample” analysis • One-tailed sign test (Clemen, 2008)
Data used • 55 expert judgment studies compiled over 20
years • 63 data sets: 604 experts, 770 seed
variables, ~68M judgments
17 of 35
Iteration Seed Variables Used Target Variables Evaluated
1 1 2 3 4
2 2 1 3 4
3 3 1 2 4
4 4 1 2 3
5 1 2 3 4
6 1 3 2 4
7 1 4 2 3
8 2 3 1 4
9 2 4 1 3
10 3 4 1 2
11 1 2 3 4
12 1 2 4 3
13 1 3 4 2
14 2 3 4 1
18 of 35
Extent of previous cross-validation studies
Mean Out-of-Sample Combination Scores (Calibration × Information)
19 of 35
Study ID No. of Experts
No. of Variables
DM Type
No. of Variables Used to Determine Performance Measure
1 2 3 4 5 6 7 8
MVOSEEDS 77 5 PWDM EWDM
0.3259 0.0279
0.5579 0.1154
0.6773 0.3071
0.8414 0.6963
A_SEED 7 6 PWDM EWDM
0.1434 0.0072
0.3312 0.0229
0.3462 0.0580
0.3332 0.1260
0.4439 0.2508
AOTDAILY 7 6 PWDM EWDM
0.0167 0.0164
0.0294 0.0313
0.0583 0.0586
0.1199 0.1036
0.2271 0.1565
FCEP 5 8 PWDM EWDM
0.0028 0.0001
0.5309 0.0008
0.7328 0.0038
0.8917 0.0135
1.0556 0.0399
1.0792 0.1059
1.1396 0.2434
BSWAAL 6 8 PWDM EWDM
0.3811 0.2697
0.2538 0.3142
0.3624 0.3458
0.3958 0.3688
0.3932 0.3862
0.3665 0.3900
0.4860 0.4406
DSM-1 10 8 PWDM EWDM
0.1546 0.2637
0.2075 0.2939
0.2448 0.3105
0.3224 0.3241
0.4849 0.3403
0.6048 0.3576
0.6591 0.4508
MONT1 11 8 PWDM EWDM
0.6249 0.2312
0.6168 0.2880
0.5673 0.3497
0.5964 0.4158
0.6656 0.4854
0.6350 0.5734
0.6423 0.7321
SO3EXPTS 4 9 PWDM EWDM
0.0123 2.9E-5
0.1847 0.0002
0.3236 0.0013
0.5801 0.0063
0.7460 0.0254
0.9834 0.0856
1.0993 0.2407
2.1950 0.5700
WATERPOL 11 9 PWDM EWDM
0.0115 0.0033
0.1661 0.0111
0.4032 0.0313
0.5544 0.0687
0.6987 0.1195
0.8737 0.1798
0.9985 0.2624
1.0289 0.4852
Single Decision Maker Dominates in 28 of 63 Cases PWDM: 21 Cases EWDM: 7 Cases
Single Modal Switching in 22 of 63 Cases EWDM gives way to PWDM: 10 Cases PWDM gives way to EWDM: 12 Cases
Dual Modal Switching (Parabolic) in 11 of 63 Cases PWDM at the extremes: 7 Cases EWDM at the extremes: 4 Cases Somewhat Random Switching in 2 of 63 Cases BSWAAL ACNEXPTS
Mean Out-of-Sample Combination Scores (Calibration × Information)
20 of 35
Study ID No. of Experts
No. of Variables
DM Type
No. of Variables Used to Determine Performance Measure
1 2 3 4 5 6 7 8
MVOSEEDS 77 5 PWDM EWDM
0.3259 0.0279
0.5579 0.1154
0.6773 0.3071
0.8414 0.6963
A_SEED 7 6 PWDM EWDM
0.1434 0.0072
0.3312 0.0229
0.3462 0.0580
0.3332 0.1260
0.4439 0.2508
AOTDAILY 7 6 PWDM EWDM
0.0167 0.0164
0.0294 0.0313
0.0583 0.0586
0.1199 0.1036
0.2271 0.1565
FCEP 5 8 PWDM EWDM
0.0028 0.0001
0.5309 0.0008
0.7328 0.0038
0.8917 0.0135
1.0556 0.0399
1.0792 0.1059
1.1396 0.2434
BSWAAL 6 8 PWDM EWDM
0.3811 0.2697
0.2538 0.3142
0.3624 0.3458
0.3958 0.3688
0.3932 0.3862
0.3665 0.3900
0.4860 0.4406
DSM-1 10 8 PWDM EWDM
0.1546 0.2637
0.2075 0.2939
0.2448 0.3105
0.3224 0.3241
0.4849 0.3403
0.6048 0.3576
0.6591 0.4508
MONT1 11 8 PWDM EWDM
0.6249 0.2312
0.6168 0.2880
0.5673 0.3497
0.5964 0.4158
0.6656 0.4854
0.6350 0.5734
0.6423 0.7321
SO3EXPTS 4 9 PWDM EWDM
0.0123 2.9E-5
0.1847 0.0002
0.3236 0.0013
0.5801 0.0063
0.7460 0.0254
0.9834 0.0856
1.0993 0.2407
2.1950 0.5700
WATERPOL 11 9 PWDM EWDM
0.0115 0.0033
0.1661 0.0111
0.4032 0.0313
0.5544 0.0687
0.6987 0.1195
0.8737 0.1798
0.9985 0.2624
1.0289 0.4852
Accuracy Measures (One-tailed Sign Test)
21 of 35
Median p-value by data set PWDM is more accurate than EWDM (p ≥ 0.5): 42 of 63 cases EWDM is more accurate than PWDM (p < 0.5): 11 of 63 cases Overall median p-value: 0.74
Median p-value by number of seed variables PWDM is more accurate in ALL cases PWDM is significantly more accurate in all but two cases
Data Used • The data set for this research comes from the
unpublished white paper by Coleman, Kulick, and Pisano (1996) on the T45TS Cockpit-21 project
• Actual data used simulated for use in the expert judgment model
22 of 35 Appendix B
23 of 35 Figure 4-3, p. 112 E-TRI Flow Diagram
Oct-93 Milestone Review: Nov-93 TPM value is predicted
24 of 35
Nov-93 Milestone Review: TPM value is realized
Expert weights are determined
25 of 35
Decision maker’s assessment is calculated • Using weighted expert predictions • Calculated for all remaining milestones
26 of 35
Nov-93 updated prediction is presented
27 of 35
E-TRI for final state (Feb-94) is calculated
28 of 35
Case Study Data
29 of 35
System E-TRI for final state (Feb-94) is calculated
30 of 35
QUESTIONS?
Garvey, P. R., & Cho, C.-C. (2003). An Index to Measure a System's Performance Risk. Acquisition Review Quarterly, Spring, 189-199.
Winkler, R. L. (1968). The consensus of subjective probability distributions. Management Science, 15(2), 61-75.
Cooke, R. M. (1991). Experts in uncertainty: opinion and subjective probability in science. New York: Oxford University Press.
Clemen, R. T. (2008). Comment on Cooke's classical method. Reliability Engineering & System Safety, 93(5), 760-765.
Coleman, C., Kulick, K., & Pisano, N. (1996). Technical performance measurement (TPM) retrospective implementation and concept validation on the T45TS Cockpit-21 program. Program Executive Office for Air Anti-Submarine Warfare, Assault, and Special Mission Programs, White Paper.
32 of 35