probing with severity: beyond bayesian probabilism and frequentist performance

73
1 Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance Deborah G Mayo December 3, 2014 Rutgers, Department of Statistics and Biostatistics

Upload: jemille6

Post on 14-Jul-2015

6.902 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐1-­‐  

Probing with Severity: Beyond Bayesian Probabilism and

Frequentist Performance Deborah G Mayo December 3, 2014

Rutgers, Department of Statistics and Biostatistics

Page 2: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐2-­‐  

1.  What  is  the  Philosophy  of  Statistics?   A few years ago, I had a conversation with Sir David Cox COX:  Deborah,  in  some  fields  foundations  do  not  seem  very  important,  but  we  both  think  foundations  of  statistical  inference  are  important;  why  do  you  think  that  is?      MAYO:  I  think  because  they  ask  about  fundamental  questions  of  evidence,  inference,  and  probability.  …in  statistics  we’re  so  intimately  connected  to  the  scientific  interest  in  learning  about  the  world,  we  invariably  cross  into  philosophical  questions  about  empirical  knowledge  and  inductive  inference.      

Page 3: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐3-­‐  

While statisticians and philosophers of science go about their work in very different ways, at one level, they ask many similar questions:  

• What should be observed, what inferences are (are not) warranted from the data?

• How well do data confirm or fit a model? • What is a good test? Does failure to reject hypothesis H yield

evidence “confirming” H?

• How can we tell if an anomaly is genuine?

Page 4: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐4-­‐  

• Is it relevant to inference if :

§ data have been used to construct or select the hypothesis for testing? (novelty, double-counting)

§ data collection stops just when a hypothesis looks good? (optional stopping)

No wonder statistics tends to readily cross over into philosophical territory.        

Page 5: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐5-­‐  

Central job of philosophers of science

• to help resolve the conceptual, logical, and methodological discomforts of scientists

• especially in fields that deal with issues of scientific knowledge, evidence, inference, learning despite uncertainties and errors?

The risk of error enters because we want to move beyond the evidence or data We may call such evidence-transcending inferences inductive inferences

Page 6: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐6-­‐  

Induction as evidence-transcending inference The conclusion is "evidence transcending": The premises (i.e., data) can be true while the conclusion inferred may be false without a logical contradiction. This frees us to talk about “induction” without presupposing certain forms of ampliative inference

…in particular, without presupposing there’s just one role for probability (however it’s interpreted) That will be one of my topics.

Page 7: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐7-­‐  

2.  Is Philosophy of Statistics Relevant for Statistical Science Practice? I would never be so bold as to suggest that statisticians are hampered by philosophical issues swirling around their methods. Only in certain moments is there a need for a philosophical or self-reflective standpoint in practice I do not want to rehash the “statistics wars” of the 60’s, 70’s, 80’s, 90’s, — up to the present ––even though the significance test controversy is still hotly debated, with task forces set up to stem automatic, recipe-like uses of statistics that have long been deplored.

Page 8: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐8-­‐  

• Increasingly, over the past decade or so some issues seem to

cry out for philosophical ministrations It’s in these recent issues I’m mainly interested (e.g., reproducibility crisis, questionable research practices (GRPs), Bayes-Frequentist unifications, big data)

Page 9: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐9-­‐  

In one sense there’s a happy eclecticism in today’s practice Many statisticians suggest throwing different and competing methods at a problem is all to the good —It increases the chance at least one will be right This may be so, but one needs to understand how to interpret and relate competing answers….which goes back to philosophical underpinnings…

Page 10: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐10-­‐  

Others think that foundational conflicts are bad for the profession and seek some kind of [Bayesian-Frequentist] “unification” or “reconciliations”

Donald Fraser: “How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantively different answers to the same problems? …Is complacency in the face of contradiction acceptable for a central discipline of science? (Fraser 2011, p. 329)

Page 11: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐11-­‐  

Jim Berger: “We [statisticians ] are not blameless….we have not made a concerted professional effort to provide the scientific world with unified testing methodology…and so are tacit accomplices in the unfortunate statistical situation (J. Berger, 2003)

Not waiting for philosophers…. “…professional agreement on statistical philosophy is not on the immediate horizon, but this should not stop us from agreeing on methodology.” But what is correct methodologically depends on what is correct philosophically

Page 12: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐12-­‐  

• Finding ways to equate posterior probabilities with frequentist error probabilities (however clever) masks underlying conflicts: we get agreements on numbers that fail both as degrees of belief and as relevant error probabilities….

• But subjective Bayesians don’t seem to like the

unifications much either. Dennis Lindley: They focus too much on technique at the expense of the “Bayesian standpoint” (i.e., updating degrees of belief) (1997) (commenting on Bernardo)

But there is growing confusion as to what the Bayesian or frequentist standpoints are …

Page 13: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐13-­‐  

• Jay Kadane, venerable subjective Bayesian: “The growth in use and popularity of Bayesian methods has stunned many of us who were involved in exploring their implications decades ago. The result, …is that there are users of these methods who do not understand the philosophical basis of the methods they are using, and hence may misinterpret or badly use the results. ….. No doubt helping people to use Bayesian methods more appropriately is an important task of our time.”

But many question whether the classic subjective construal is the only way to be helped…

Page 14: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐14-­‐  

In an attempted meeting of the minds (a Bayesian and an error statistician) they suggest that: Andrew Gelman and Cosma Shalizi (2013):

“The main point where we disagree with many Bayesians is that we do not think that Bayesian methods are useful for giving the posterior probability that a hypothesis is true.... ... for evaluating a model, we prefer …what Cox and Hinkley call ‘pure significance testing’) (Gelman & Shalizi 2013, p. 2) “Implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo (1996), despite the latter’s frequentist orientation”. (p. 10).

Page 15: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐15-­‐  

The philosophical doctor is in (to whittle it down): The current situation is a result of never having been clear on contrasting views on:

(a) the roles of probability in ampliative inference and (b) the nature and goals of inductive/statistical inference in

relation to scientific inquiry

• What is correct methodologically turns on philosophy • Methodology without philosophy is shallow (as is philosophy

of statistics without statistical methodology)

Page 16: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐16-­‐  

3. Popper: Frequentists as Carnap: Bayesians To begin, we might probe this philosophy-statistics analogy:

• I said philosophers of statistics address some similar problems as statisticians but in different ways

• Philosophers have often looked for a logic to relate data x and hypotheses H: Confirmation relation: C(H, x)– Carnap

Karl Popper focuses on logics of “falsification” and “corroboration”

Page 17: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐17-­‐  

Popper: “In opposition to [the] inductivist attitude, I assert that C(H,x) must not be interpreted as the degree of corroboration of H by x, unless x reports the results of our sincere efforts to overthrow H. The requirement of sincerity cannot be formalized—no more than the inductivist requirement that e must represent our total observational knowledge. (Popper 1959, p. 418.) Observations or experiments can be accepted as supporting a theory (or a hypothesis, or a scientific assertion) only if these observations or experiments are severe tests of the theory--or in other words, only if they result from serious attempts to refute the theory.” (Popper, 1994, p. 89)

Page 18: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐18-­‐  

4. Severity requirement When we reason this way, we are insisting on

Weakest Requirement for a Genuine (Severe) Test: Agreement between data x and H fails to count in support of a hypothesis or claim H, if so good an agreement was (virtually) assured even if H is false—no test at all! (Bad evidence, no test: BENT)

Yet Popperian computations never gave him a way to characterize severity adequately.

Page 19: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐19-­‐  

Aside: Popper wrote to me “I regret never having learned statistics” • I argue that the central role of probability in statistical

inference is severity—its assessment and control.

• Existing error probabilities (confidence levels, significance levels) may but need not provide severity assessments.

Data x (from test T) are evidence for H only if H has passed a severe test with x (one with a reasonable capability of having detected flaws in H). So we need to assess this “capability” in some way.

Page 20: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐20-­‐  

“A severity account based on error probabilities” (error statistics)

New name: the differences in justification and interpretation call for one: existing labels—frequentist, sampling theory, Fisherian, Neyman-Pearsonian—are too associated with hard line views.

Page 21: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐21-­‐  

5.  Two  main  roles  of  probability  in  statistical  inference

Probabilism.  To  provide  a  post-­‐data  assignment  of  degree  of  probability,  confirmation,  support  or  belief  in  a  hypothesis,  absolute  or  comparative,  given  data  x0

(Bayesian  posterior,  Bayes  ratio,  Bayes  boosts)    

I  would  include  likelihoodism    Performance.  To  ensure  long-­‐run  reliability  of  methods,  coverage  probabilities,  control  the relative frequency of erroneous inferences in a long-run series of trials.    

Page 22: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐22-­‐  

What happened to the goal of the severity criterion? Neither “probabilism” nor “performance” directly captures it. Good long-run performance is a necessary not a sufficient condition for avoiding insevere tests.  The problems with QRPs (questionable research practices): selective reporting, multiple testing, stopping when the data look good are not problems about long-runs— It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpretation.

Page 23: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐23-­‐  

Compare:    

• Probabilism  says  H  is  not  justified  unless  it’s  true  or  probable  (or  increases  probability,  makes  firmer).    

• Performance.  says  H  is  not  justified  unless  it  stems  from  a  method  with  low  long-­‐run  error      

• Probativism  says  H  is  not  justified  unless  something  has  been  done  to  probe  ways  we  can  be  wrong  about  H      My  work  is  extending  and  reinterpreting  frequentist  error  statistical  methods  to  reflect  the  severity  rationale.  

 

Page 24: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐24-­‐  

For  solving  problems:    • in  philosophy  of  science  about  evidence  (Error  and  the  Growth  of  Experimental  Knowledge  1996)  

• in  philosophy  of  statistics  

Note: The severity construal blends testing and estimation, but I keep to testing talk to underscore the probative demand.

I admit that I am supplying a philosophy—one that makes sense of their use and scotches well-known criticisms and misinterpretations At most I found hints and examples in E.S. Pearson

Page 25: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐25-­‐  

Neyman was generally the behavioristic-performance one. A few years ago, I found an obscure article where Neyman responds to philosopher Carnap’s criticism of “Neyman’s frequentism” Neyman (criticizing Carnap): “I am concerned with the term ‘degree of confirmation’ introduced by Carnap. …We have seen that the application of the locally best one-sided test to the data…failed to reject the hypothesis [that the 26 observations come from a source in which the null hypothesis is true]. The question is: does this result ‘confirm’ the hypothesis that H0 is true of the particular data set? ”.(Neyman 1955, p. 41)

Page 26: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐26-­‐  

“Locally best one-sided Test T

A sample X = (X1, …,Xn) each Xi is Normal, N(µ,σ2), (NIID), σ assumed known; M the sample mean

H0: µ ≤ µ0 against H1: µ > µ0.

Test Statistic d(X) = (M - µ0)/σx, σx = σ /√𝑛

Test fails to reject the null, d(x0) ≤ cα. “The question is: does this result ‘confirm’ the hypothesis that H0 is true of the particular data set]? ” (Neyman).

Carnap says yes…

Page 27: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐27-­‐  

Neyman: “….the attitude described is dangerous. …the chance of detecting the presence [of discrepancy δ from the null], when only [this number] of observations are available, is extremely slim, even if [δ is present]. “One may be confident in the absence of that discrepancy only if the power to detect it were high”. (power analysis)

If Pr(d(X) > cα; µ = µ0 + δ) is high

d(X) < cα infer: discrepancy < δ

Page 28: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐28-­‐  

6. Severity and Fallacies of Non-Statistically Significant Results  

Neyman’s  criticism  of  Carnap  deals  with  a  classic  fallacy  of  non-­‐significant  results:  to  construe  such  a  “negative”  result  as  evidence  FOR  the  correctness  of  the  null  hypothesis.      

 “no  evidence  against”  is  not  “evidence  for”      Merely  surviving  the  statistical  test  is  too  easy,  occurs  too  

frequently,  even  when  the  null  is  false.      

   

Page 29: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐29-­‐  

(1) P(d(X) > cα; µ = µ0 + δ) Power to detect δ

• Neyman requires (1) to be high (for non-significance to warrant µ < µ0 + δ)

• Just missing the cut-off cα is the worst case

• It is more informative to look (2):

(2) P(d(X) > d(x0); µ = µ0 + δ) “attained power”

• (1) can be low while (2) is high

• a measure of the severity (or degree of corroboration) for the inference µ < µ0 + δ

Page 30: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐30-­‐  

Not the same as something called “retrospective power” or “ad hoc” power! (There µ is identified with the observed mean)

In Mayo and Cox 2006, it’s in terms of the P-value (“Frequentist principle of evidence”): FEV:  insignificant  result:  A  moderate  P-­‐  value  is  evidence  of  the  absence  of  a  discrepancy  δ   from    H0,  only    if  there    is  a    high    probability  the    test  would    have    given  a  worse   fit  with    H0      (i.e.,    d(X) > d(x0)  )  were  a  discrepancy    δ  to    exist.      (i.e.,  only  if  (2)  is  high)      

Page 31: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐31-­‐  

This  is  equivalently  captured  in  Severity  Rule  (Mayo  1996,  Mayo  and  Spanos  2006):

Test  T:  Normal  testing:  H0:  µ  <  µ0  vs    H1:  µ  >  µ0    σ  is  known  

   (FEV/SEV):  If  d(x)  is  not  statistically  significant,  then  test  T  passes  µ  <  M0  +  kεσ /√𝑛  with  severity  (1  –  ε),      where  P(d(X)  >  kε)  =  ε.  

The  connection  with  the  upper  confidence  limit  is  obvious.  Infer:  µ  <  CIu

   

Page 32: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐32-­‐  

If one wants to emphasize the post-data measure, one can write:

SEV(µ < M0 + δσx) to abbreviate:

The severity with which

(µ < M 0 + δσx). passes test T

It’s computed Pr(d(X) > d(x0); µ = µ0 + δ) Severity has 3 terms: SEV(Test, outcome, inference)

Page 33: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐33-­‐  

One can consider a series of upper discrepancy bounds…

SEV(µ < M 0 + 0σx) = .5

SEV(µ < M 0 + .5σx) = .7

SEV(µ < M 0 + 1σx) = .84

SEV(µ < M 0 + 1.5σx) = .93

SEV(µ < M 0 + 1.96σx) = .975 [How this relates to work by Min-ge Xie and others on confidence distributions is something I hope to learn more about.] But aren’t I just using this as another way to say how probable each claim is?

Page 34: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐34-­‐  

No. This would lead to inconsistencies (if we mean mathematical probability), but the main thing is, or so I argue, probability gives the wrong logic for “how well-tested” (or “corroborated”) a claim is (There may be a confusion of ordinary language use of “probability”: belief is very different from well testedness) Note: low severity is not just a little bit of evidence, but bad evidence, no test (BENT)

Page 35: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐35-­‐  

7. Severity vs. Rubbing–off The severity construal is different from what I call the

Rubbing off construal: The procedure is rarely wrong, therefore, the probability it is wrong in this case is low.

Still too much of a performance criteria, too behavioristic

The long-run reliability of the rule is a necessary but not a sufficient condition to infer H (with severity)

Page 36: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐36-­‐  

The reasoning instead is counterfactual:

H: µ < M0 + 1.96σx

(i.e., µ < CIu )

H passes severely because were this inference false, and the true mean µ > CIu then, very probably, we would have observed a larger sample mean.

Page 37: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐37-­‐  

8. Error (Probability) Statistics What is key on the statistics side: The probabilities refer to the distribution of statistic d(x) (sampling distribution)

What is key on the philosophical side: error probabilities may* be used to quantify probativeness or severity of tests (for a given inference) *they do not always or automatically give this

Under this umbrella I include the use of error probabilities for performance goals, but currently that seems to be the only way it’s used

Cox has long spoken of using P-values to measure “consistency” with a null: in effect I’m giving the justification for that and other “evidential construals”

Page 38: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐38-­‐  

The analogous reasoning is used to avoid fallacies of significant results or fallacies of rejection:

Two forms: • Infer a substantive inference unwarranted from the statistical

inference • Infer a discrepancy from the null beyond what the test

warrants I won’t go through it, except to oote: Severity goes in the

opposite direction of power when it comes to inferring a discrepancy from the null with a statistically significant result:

Page 39: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐39-­‐  

Example: an  α-­‐significant  difference  is  indicative  of  less  of  a  discrepancy  from  the  null  with  large  n  than  if  it  resulted  from  a  smaller  sample  size.      

 Instead  of  going  through  the  worry  in  a  general  way,    I’ll  illustrate  and  at  the  same  time  addresses  the  question:      How is this relevant to current controversies about P-values and such?  

Page 40: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐40-­‐  

9.  Higgs  discovery:  “5  sigma  observed  effect”.  So  worried  about  overly  sensitive  tests,  researchers  refused  to  announce  evidence  for  the  discovery  of  a  Higgs  particle  on  July  4,  2012  until  they  reached  a  “5  sigma  observed  effect”.    

I recently spoke about statistics in the Higgs discovery at the Philosophy of Science Association

• Because the 5 sigma report refers to frequentist statistical tests, the discovery was immediately imbued with controversies from philosophy of statistics

Page 41: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐41-­‐  

Bad  Science?  (O’Hagan,  prompted  by  Lindley)  To the ISBA: “Dear Bayesians: We’ve heard a lot about the Higgs boson. ...Specifically, the news referred to a confidence interval with 5-sigma limits.… Five standard deviations, assuming normality, means a p-value of around 0.0000005… Why such an extreme evidence requirement? We know from a Bayesian perspective that this only makes sense if (a) the existence of the Higgs boson has extremely small prior probability and/or (b) the consequences of erroneously announcing its discovery are dire in the extreme. … …. Are the particle physics community completely wedded to frequentist analysis? If so, has anyone tried to explain what bad science that is?”

Page 42: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐42-­‐  

Not bad science at all!

• HEP physicists are sophisticated with their statistical methodology: they’d seen too many bumps disappear.

• They want to ensure that before announcing the hypothesis H*: “a new particle has been discovered” that:

H*  has  been  given  a  severe  run  for  its  money.  

 

Significance  tests  and  cognate  methods  (confidence  intervals)  are  methods  of  choice  here  for  good  reason  

Page 43: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐43-­‐  

   

 Statistical  significance  test  in  the  Higgs:    (i)  Null  or  test  hypothesis:  in  terms  of  a  model  of  the  detector  

μ  is  the  “global  signal  strength”  parameter  

H0:  μ  =  0  i.e.,  zero  signal  (background  only  hypothesis)    

Η0: µ  =  0   vs. Η1: µ   > 0  μ  =  1:  Standard  Model  (SM)  Higgs  boson  signal  in  addition  to  

the  background      

Page 44: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐44-­‐  

 (ii)  Test  statistic  or  distance  statistic:  d(X):  how  many  excess  events  of  a  given  type  are  observed  (from  trillions  of  collisions)  in  comparison  to  what  would  be  expected  from  background  alone  (in  the  form  of  bumps).    (iii)  The  P-­‐value  (or  significance  level)  associated  with  d(x0)  is  the  probability  of  a  difference  as  large  or  larger  than  d(x0),  under  H0  :  

 P-­‐value=Pr(d(X)  >  d(x0);  H0)  

   Usually  the  P-­‐value  is  sufficiently  small  if    ~.05,  .01,  .001      

Page 45: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐45-­‐  

Pr(d(X)  >  5;  H0)=  .0000003    

The  probability  of  observing  results  as  or  more  extreme  as  5  sigmas,  under  H0,  is  approximately  1  in  3,500,000.    

 

The  actual  computations  are  based  on  simulating  what  it  would  be  like  were  Η0: µ  =  0  (signal  strength  =  0),  fortified  with  much  cross-­‐checking  of  results          

Page 46: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐46-­‐  

 The P-Value Police

When the July 2012 report came out, a number of people set out to grade the different interpretations of the P-value report:

Larry Wasserman (“Normal Deviate” on his blog) called them the “P-Value Police”.

• Job: to examine if reports by journalists and scientists could by any stretch of the imagination be seen to have misinterpreted the sigma levels as posterior probability assignments to the various models and claims.

Page 47: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐47-­‐  

Thumbs up or down Thumbs up, to the ATLAS group report:

“A statistical combination of these channels and others puts the significance of the signal at 5 sigma, meaning that only one experiment in three million would see an apparent signal this strong in a universe without a Higgs.”

Thumbs down to reports such as:

“There is less than a one in 3.5 million chance that their results are a statistical fluke.”

Page 48: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐48-­‐  

Critics (Spiegelhalter) allege they are misinterpreting the P-value as a posterior probability on H0.

Not so.

H0  does  not  say  the  observed  results  are  due  to  background  alone,  or  are  flukes,  

Η0: µ  =  0    

Although  if  H0  were  true  it  follows  that  various  results  would  occur  with  specified  probabilities.    

(In  particular,  it  entails  that  large  bumps  are  improbable.)  

 In fact it is an ordinary error probability.  

Page 49: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐49-­‐  

(1) Pr(Test T produces d(X) > 5; H0) ≤ .0000003

True, the inference actually detached goes beyond a P-value report.

(2)There is strong evidence for

H*: a Higgs (or a Higgs-like) particle.

Inferring (2) relies on an implicit principle of evidence.

Page 50: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐50-­‐  

SEV Principle for statistical significance: If Pr(Test T produces d(X) < d(x0); H0 )is high, then μ > μ0 passes the test with high severity…

(1)’ Pr(Test T produces d(X) < 5; H0) > .9999997

• With probability .9999997, the bumps would be smaller, would behave like flukes, disappear with more data, not be produced at both CMS and ATLAS, in a world given by H0.

• They didn’t disappear, they grew (2) So, H*: a Higgs (or a Higgs-like) particle.    

Page 51: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐51-­‐  

Goes beyond long-run performance: Interpret 5 sigma bumps as a real effect (a discrepancy from 0), you’d erroneously interpret data with probability less than .0000003

An error probability

The warrant isn’t low long-run error (in a case like this) but detaching an inference based on a severity argument.

Qualifying claims by how well they have been probed (precision, accuracy).

Page 52: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐52-­‐  

   Those  who  think  we  want  a  posterior  probability  in  H*  might  be  sliding  from  what  may  be  inferred  from  this  legitimate  high  probability:        

Pr(test  T  would  not  reach  5  sigma;  H0)  >    .9999997    

With  probability  .9999997,  our  methods  would  show  that  the  bumps  disappear,  under  the  assumption  data  are  due  to  background  H0.      Most  HEP  physicists  believe  in  Beyond  Standard  Model  physics  (BSM)  but  to  their  dismay,  they  find  themselves  unable  to  reject  the  SM  null  (bumps  keep  disappearing)      

Page 53: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐53-­‐  

Look  Elsewhere  Effect  (LEE)    

Lindley/O’Hagan: “Why such an extreme evidence requirement?” Their  report  is  of  a  nominal  (or  local)  P-­‐value:  the  P-­‐value  at  a  particular,  data-­‐determined,  mass.    § The  probability  of  so  impressive  a  difference  anywhere  in  a  mass  range  would  be  greater  than  the  local  one.  

 § Requiring  a  P-­‐value  of  at  least  5  sigma,  is  akin  to  adjusting  for  multiple  trials  or  look  elsewhere  effect  LEE.    

This leads to THE key issue of controversy in the philosophy of statistics: whether to take account of selection effects

Page 54: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐54-­‐  

10. Likelihood Principle LP (strong likelihood principle) Taking into account sampling distribution once the data are observed violates the LP Savage: “…if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ…” (Savage 1962, p. 17.)  

Page 55: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐55-­‐  

Optional Stopping Effect

To illustrate the LP violation: “Optional Stopping Effect”. We have a random sample from a Normal distribution with mean µ and standard deviation σ, i.e.

Xi ~ N(µ,σ) and we test H0: µ=0, vs. H1: µ≠0. Stopping rule: Keep sampling until H is rejected at the .05 level

(i.e., keep sampling until |M| ≥ 1.96 σ /√𝑛 ).

Page 56: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐56-­‐  

 “Trying  and  trying  again”:  having  failed  to  rack  up  a  1.96  SD  difference  after,  say,  10  trials,  the  researcher  went  on  to  20,  30  and  so  on  until  finally  obtaining  a  1.96  SD  unit  difference  is  obtained.        With  this  stopping  rule  the  actual  significance  level  differs  from,  and  will  be  greater  than  the  .05  that  would  hold  for  n  fixed.  (nominal  vs.  actual  significance  levels).        

Page 57: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐57-­‐  

Jimmy  Savage  famously  assured  statisticians:  “optional  stopping  is  no  sin”,  the  problem  must  be  significance  levels.      “This  irrelevance  of  stopping  rules  to  statistical  inference  restores  a  simplicity  and  freedom  to  experimental  design  that  had  been  lost  by  classical  emphasis  on  significance  levels”  (in  the  sense  of  Neyman  and  Pearson)  (Edwards,  Lindman,  Savage  1963,  239).    

Page 58: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐58-­‐  

The  key  difference:  likelihoods  fix  the  actual  outcome,  while  error  statistics  considers  outcomes  other  than  the  one  observed  in  order  to  assess  the  error  properties.  

LP à irrelevance of, and no control over, error probabilities (Birnbaum).

Aside: There is a famous “radical” argument purporting to show that error statistical principles entail the Likelihood Principle (LP) (Birnbaum, 1962), but the argument is flawed—invalid or unsound (Mayo 2010/2014).

Page 59: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐59-­‐  

11. Non-subjective (Conventional) Bayesians Abandon the LP It has been thought that Bayesian foundational superiority over error statisticians stems from Bayesians upholding, while frequentists violate, the likelihood principle (LP),

Frequentists, concede that they (the Bayesians) are coherent, we are not… What then to say about leading non-subjective (conventional/reference) Bayesians, (Bernardo, 2005, Berger, 2004) admitting that “violation of principles such as the likelihood principle is the price that has to be paid for objectivity.” (Berger, 2004).

Page 60: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐60-­‐  

Although they are using very different conceptions of “objectivity” — there seems to be an odd sort of agreement between them and the error statistician.

Do the concessions of conventional Bayesians bring them into the error statistical fold? I think not, (though I will leave this as an open question). I base my (tentative) no answer on:

While they may still have some ability to ensure low error probabilities in a long-run—performance (in some sense)

This would not make them severe error probers….

Page 61: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐61-­‐  

What are its (Bayesian) foundations? “Reference” priors may not be construed as measuring beliefs or even probabilities—they are often improper.

They are mere “reference points” for getting posteriors, but how shall they be interpreted?

If prior probabilities represent background information, why do they differ according to the experimental model?

Page 62: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐62-­‐  

Concluding Overview: I  began  with  some  remarks  on  current  problems  in  philosophy  of  statistics:  

• Significance  test  controversies    • Eclecticism,  Unification  attempts  • Unclarity  about  “philosophical  standpoints”  (Bayesian  or  frequentist)    

Philosophy-statistics analogy: Popper: Frequentists as Carnap: Bayesians But frequentism in statistics focus on performance and Popper never came up with an account of severity and corroboration

Page 63: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐63-­‐  

Turning to statistical foundations: Underlying the debates are two assumptions as to: What we need Probabilism: the role of probability in inference is to assign a degree of belief, support, confirmation (given by mathematical probability) What we get from error statistical (“frequentist’ )methods

Performance: Behavioristic Construal

Page 64: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐64-­‐  

The criticisms of frequentist statistical methods take the form of one or both: • Error probabilities do not supply posterior probabilities in

hypotheses • Long-run performance alone may not yield good inferences in

particular cases

Page 65: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐65-­‐  

We  reject  probabilism  and  performance:    

• Probabilism  says  H  is  not  justified  unless  it’s  true  or  probable  (or  increases  probability,  makes  firmer).    

• Performance.  says  H  is  not  justified  unless  it  stems  from  a  method  with  low  long-­‐run  error    

-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    

• Probativism  says  H  is  not  justified  unless  something  has  been  done  to  probe  ways  we  can  be  wrong  about  H      (trying  to  make  good  on  Popper?)    

Page 66: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐66-­‐  

My  work  is  extending  and  reinterpreting  frequentist  error  statistical  methods  to  reflect  the  severity  rationale  (and  other  “evidential”  interpretations).  

• The severity principle directs us to the relevant error probabilities, avoiding the classic counterintuitive examples

• Where differences remain (disagreement on numbers) e.g., P-values and posteriors, we should recognize the difference in the goals promoted

 

Page 67: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐67-­‐  

 Test  T+:  Normal  testing:  H0:  µ  <  µ0  vs.    H1:  µ  >  µ0    σ  known    (FEV/SEV):  If  d(x)  is  not  statistically  significant,  then    µ  <  M0  +  kεσ/√𝑛 passes  the  test  T+  with  severity  (1  –  ε).      (FEV/SEV):  If  d(x)  is  statistically  significant,  then    µ  >  M0  -­‐  kεσ/√𝑛 passes  the  test  T+  with  severity  (1  –  ε).      where  P(d(X)  >  kε)  =  ε.    

Page 68: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐68-­‐  

FEV:  insignificant  result:  A  moderate  P-­‐  value  is  evidence  of  the  absence  of  a  discrepancy  δ   from    H0,  only    if  there    is  a    high    probability  the    test  would    have    given  a  worse   fit  with    H0      (i.e.,    d(X) > d(x0)  )  were  a  discrepancy    δ  to    exist.      FEV  significant  result  d(X) > d(x0)  is  strong  evidence  of  discrepancy  δ   from    H0,  if  and  only    if,  there    is  a    high    probability  the    test  would    have    d(X) < d(x0)  were  a  discrepancy  as  large  as  δ  absent.  

Page 69: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐69-­‐  

Some Relevant Mayo references:  • Mayo, D. G. (2014). “On the Birnbaum Argument for the Strong Likelihood Principle,” (with

discussion) Statistical Science 29(2) pp. 227-239, 261-266. • Mayo, D. G. (2013) “Presented Version: On the Birnbaum Argument for the Strong Likelihood

Principle.” In JSM Proceedings, Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association, 440-453.

• Mayo, D. G. (2012). “Statistical Science Meets Philosophy of Science Part 2: Shallow versus Deep Explorations”, Rationality, Markets, and Morals (RMM) 3, Special Topic: Statistical Science and Philosophy of Science, 71–107.

• Mayo, D. G. (2011) “Statistical Science and Philosophy of Science: Where Do/Should They Meet in 2011 (and beyond).” Rationality, Markets and Morals (RMM) 2, Special Topic: Statistical Science and Philosophy of Science, 79–102.

• Mayo, D. G. and Cox, D. R. (2011) “Statistical Scientist Meets a Philosopher of Science: A Conversation with Sir David Cox.” Rationality, Markets and Morals (RMM), 2, Special Topic: Statistical Science and Philosophy of Science, 103-114.

• Mayo, D. G. and Spanos, A. (2011) "Error Statistics" in Philosophy of Statistics , Handbook of Philosophy of Science Volume 7 Philosophy of Statistics, (General editors: Dov M. Gabbay, Paul Thagard and John Woods; Volume eds. Prasanta S. Bandyopadhyay and Malcolm R. Forster.) Elsevier: 1-46.

• Mayo, D. G. (2010). "An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 305-14.

Page 70: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐70-­‐  

• Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.

• Cox D. R. and Mayo. D. G. (2010). "Objectivity and Conditionality in Frequentist Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 276-304.

• Mayo, D. G. and Spanos, A. (2010). "Introduction and Background: Part I: Central Goals, Themes, and Questions; Part II The Error-Statistical Philosophy" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), Cambridge: Cambridge University Press: 1-14, 15-27.

• Mayo, D.G. and Cox, D. R. (2006) "Frequentists Statistics as a Theory of Inductive Inference," Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

• Mayo, D. G. and Spanos, A. (2006). "Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction," British Journal of Philosophy of Science, 57: 323-357.

• Mayo, D. (2005). "Philosophy of Statistics" in S. Sarkar and J. Pfeifer (eds.) Philosophy of Science: An Encyclopedia, London: Routledge: 802-815.

Page 71: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐71-­‐  

Mayo, D. and Spanos, A (2004). "Methodology in Practice: Statistical Misspecification Testing," Philosophy of Science 71: 1007-1025. General References:

Bayarri, M. J. and J. O. Berger (2004), The Interplay of Bayesian and Frequentist Analysis. Statistical Science 19(1), 58-80.

J. O. Berger (2003), “Could Fisher, Jeffreys and Neyman Have Agreed on Testing?” Statistical Science 18(1), 1–12.

Bernardo, J. M. 2005. “Reference Analysis.” In Handbook of Statistics, edited by D. K. Dey and C. R. Rao, 25: Bayesian Thinking, Modeling and Computation, 17–90. Amsterdam: Elsevier.

Edwards, W., Lindman, H. & Savage, L. J. (1963). Bayesian Statistical Inference for Psychological Research. Psych. Rev. 70(3), 193–242.

Fraser, D. A. S. (2011), Is Bayes Posterior just Quick and Dirty Confidence? Rejoinder. Statistical Science 26(3), 329-331.

Gelman, A., & Shalizi, C. (2013). Philosophy and the Practice of Bayesian Statistics and Rejoinder. Brit. J. Math. & Stat. Psych. 66(1), 8–38; 76-80.

Kadane, Joseph B. 2011. Principles of Uncertainty. Chapman and Hall/CRC. Lindley, D. 1997. “Some Comments on ‘Non-informative Priors Do Not Exist’.” Journal of

Statistical Planning and Inference 65(1), 182–189.

Page 72: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐72-­‐  

Neyman, J. (1955), “The Problem of Inductive Inference.” Communications on Pure and Applied Mathematics 8(1), 13–46.

Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. In J. Neyman and E.S. Pearson, 1967, Joint Statistical Papers, (99-115). Cambridge: CUP.

Popper, K. R. (1959), The Logic of Scientific Discovery. New York: Basic Books. Popper, K. R. (1994), Realism and the Aim of Science: From the Postscript to the Logic of

Scientific Discovery. Oxford - New York: Routledge. Savage, L., ed. (1962), The Foundations of Statistical Inference: A Discussion. London: Methuen

& Co. Singh, K., Xie, M. & W. E. Strawderman (2005), Combining Information from Independent

Sources through Confidence Distributions The Annals of Statistics, 33(1), 159-183 Xie, M. & Singh, K. (2013), Confidence distribution, the frequentist distribution estimator of a

parameter: A review. International Statistical Review, 81(1), 3-39.

Higgs Online links: • Atlas report: http://cds.cern.ch/record/1494183/files/ATLAS-CONF-2012-162.pdf

• Atlas Higgs experiment, public results: https://twiki.cern.ch/twiki/bin/view/AtlasPublic/HiggsPublicResults

• CMS Higgs experiment, public results: https://twiki.cern.ch/twiki/bin/view/CMSPublic/PhysicsResultsHIG

Page 73: Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance

 -­‐73-­‐  

• Cousins, R. (2014). “The Jeffreys-Lindley Paradox and Discovery Criteria in High Energy Physics” http://arxiv.org/abs/1310.3791

• O’Hagan letter:

§ Original letter with responses: http://bayesian.org/forums/news/3648

§ 1st link in a group of discussions of the letter: http://errorstatistics.com/2012/07/11/is-particle-physics-bad-science/

• Overbye, D. (March 15, 2013) “Chasing the Higgs,” New York Times: http://www.nytimes.com/2013/03/05/science/chasing-the-higgs-boson-how-2-teams-of-rivals-at-CERN-searched-for-physics-most-elusive-particle.html?pagewanted=all&_r=0

• Spiegelhalter, D. (August 7, 2012) blog, Understanding Uncertainty , “Explaining 5 sigma for the Higgs: how well did they do?” http://understandinguncertainty.org/explaining-5-sigma-higgs-how-well-did-they-do

• Strassler, M. (July 2, 2013) blog, Of Particular Significance, “A Second Higgs Particle”: http://profmattstrassler.com/2013/07/02/a-second-higgs-particle/

http://normaldeviate.wordpress.com/2012/07/11/the-higgs-boson-and-the-p-value-police/