task force summary

TASK FORCE SUMMARY

Method Design

Don’t ‘pretend’ it’s something it’s not Hypothesis generating vs. Hypothesis testing

Or exploratory vs. confirmatory Both can be of great value and they are not

mutually exclusive even within a study Populations can be anything, make sure it’s

clear which you are trying to speak to Sampling

Can actually be a quite complex undertaking, make sure it’s clear how the data was arrived at

Method Random assignment

Critical in experimental design Do not think you are random

Humans are terrible at it, e.g. let software decide assignment1

In cases of non-experimental design, ‘comparison’ groups may be implemented but are not true controls and should not be implied as such

Control can be introduced via design and analysis Random assignment and control do not

provide causality Causal claims are subjective ones made based on

evidence, control of confounds, contiguity, common sense etc.

Measurement Variables

Precision in naming is a must Variable names should reflect operational definitions of constructs

For example: Intelligence no, IQ test score yes Nothing about how that value is derived should be left to question

Range and calculations must be made extremely clear Instruments

Reliability standards in psychology are low, and somehow getting worse The easiest way to ruin a study and waste a lot of time is using a poor

measure; It only takes one to muck up everything You are much better off assuming that a previously used instrument was a

bad idea than assuming that it’s ok because someone else used it before Even when using a well-known instrument you should report the

reliability for your study whenever possible. This not only informs about what populations a measure may or may not be

reliable for, it is crucial for meta-analysis Recall that there is no single ‘Reliability’ for an instrument, there are

reliability estimates for that instrument for various populations

Measurement Procedure

Methods of collection must be sound and every aspect about it must be communicated so others can be sure of lack of bias

“Missing” data can be accounted for in a variety of ways this day and age And the worst way to do it is completely ignoring incomplete

cases, which can introduce extreme bias into a study Power and sample size

Don’t be lazy, get a big sample. It is very easy to calculate the sample size needed for

typical analyses However there are many problems with such estimates both

theoretical and practical as we will discuss later The main thing is that it should be clear how the present

sample size was determined

Results Complications

Obviously any problems that arise should be made known You will be able to do so easily with a thorough initial

examination of data Search for outliers, miskeys etc. Test statistical assumptions Identify missing data

Inspecting your data is not fishing, snooping or whatever, it is required for doing minimally adequate research1

Visual methods are best and really highlight issues easily From the article “if you assess hypotheses without

examining your data, you risk publishing nonsense.” “If you assess hypotheses without examining your data,

you will publish nonsense.” Fixed.

Results Analysis

Your analysis is determined before data collection, not after If you do not know what analysis to run and you’ve already collected the

data, you just wasted a lot of time Theory Research Hypotheses Analysis ‘family’1 Appropriate

measures for those analyses Data collection The only exception to this is when using archival data, but then if doing that,

you have a whole host of other problems to deal with. “Do not choose an analytic method to impress your readers or to

deflect criticism.” Unfortunately it seems common in psych for researchers to choose the

analysis before the research question, mostly for the former reason (at which point they do it poorly and have the opposite effect on those who do know the analysis)

While “the simpler classical approaches” are fine, I do not agree that they should have special status if for no other reason than because neither data nor sufficiently considered research questions conform to their use except on rare occasion2. Furthermore, we also have the tools to do much better and as easily understood analyses, and saying an analysis is ‘complex’ is often more a statement about familiarity than it is about difficulty.

Results Statistical computing Regarding programs specifically

“There are many good computer programs for analyzing data.”

“If a computer program does not provide the analysis you need, use another program rather than let the computer shape your thinking.”

Regarding not letting the program do your thinking for you. “Do not report statistics found on a printout without

understanding how they are computed or what they mean.” “There is no substitute for common sense.”

Is it just me or are these very clear and easily understood statements? Would you believe I’ve actually had to defend them?

Results Assumptions

“You should take efforts to assure that the underlying assumptions required for the analysis are reasonable given the data.”

Despite this it is often difficult to find any mention of analysis of assumptions or appropriate and modern ways of dealing with the problem of not meeting them.

Hypothesis Testing “Never use the unfortunate expression ‘accept the

null hypothesis.’” Outcomes are fuzzy, that’s ok.

Results Effect sizes

“Always present effect sizes for primary outcomes.” “Always present effect sizes.” Fixed.

Small effects may still have practical importance or maybe that finding is more important to others than to you.

Confidence intervals Reporting uncertainty of estimate is important. Do

it. And do it for the effect sizes. “Interval estimates should be given for any effect sizes

involving principal outcomes”

Results Multiple comparisons/tests

First, pairwise methods… were designed to control a familywise error rate based on the sample size and number of comparisons. Preceding them with an omnibus F test in a stagewise testing procedure defeats this design, making it unnecessarily conservative.

Second, researchers rarely need to compare all possible means to understand their results or assess their theory; by setting their sights large, they sacrifice their power to see small.

Third, the lattice of all possible pairs is a straightjacket; forcing themselves to wear it often restricts researchers to uninteresting hypotheses and induces them to ignore more fruitful ones.

Again, fairly straightforward in the recommendation of not ‘laying waste with t-tests’.

Results “There is a variant of this preoccupation with all possible

pairs that comes with the widespread practice of printing p values or asterisks next to every correlation in a correlation matrix… One should ask instead why any reader would want this information.” People do not need an asterisk to tell them whether a

correlation is strong or not. The correlation is an effect size and should be treated

accordingly Humans are good pattern recognizers, if there is a trend they

will likely spot it on their own or you might make it more apparent in summary statements that highlight such patterns. Putting asterisks all over the place1 doesn’t imply anything more than that you are going to prop up poor results with statistical significance, or worse, that some ‘fishing’ went on.

Results Causal claims

Establishing causality is tricky business, especially since it can’t technically be done There is no causality statistic, and neither

causal modeling nor experimentation establish it in and of themselves

However, we do assume causal relations based on evidence and careful consideration of the problem itself, but be prepared for a difficult undertaking in attempting to establishing.

Results Tables and figures People simply do not take enough time or put enough thought into how

their results are displayed Like anything else, you need to be able to hold your audience’s attention People spend a lot of time going back over tables and figures, and more than

they do rereading the text. It is very easy to display a lot of pertinent information in a fairly simple

graph, and this is the goal: max info min clutter. Furthermore, what can be displayed in in a meaningful way graphically

is not restricted1

Any number of graphs you’ve never come across may be the best This is where you can really be creative, allow yourself to be!

Unfortunately, many limit themselves to the limitations of their statistical program, and while trying to spruce up bad graphics end up making interpretation worse E.g. 3-d bar chart Stats programs are in general behind in their offerings compared to what

graphics programs are available (obviously), and some are so archaic as to actually make customizing simple graphs a labor intensive enterprise.

Discussion Interpretation

Credibility, generalizability, and robustness Conclusions

Do not reside in a vacuum but must be placed within the context of prior and ongoing relevant studies

Do not overgeneralize. In the grand scheme of things one study is rarely worth much and no study has value without replication/validation

Thoughtfully make recommendations on issues to be addressed by future research and how they may do so “Further research must be done…” Is already known

before you started coming up with theories to test. Might as well say “Future research should be printed in black ink.”, it’d be about as useful.

The real problem The initial approach laid out

Fisher, R.A. (1925). Statistical Methods for Research Workers. Fisher, R.A. (1935). The Design of Experiments. Neyman, Jerzy (1937). "Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability", Philosophical Transactions of the Royal Society of London. Series A.1

Immediate criticism Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association. Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association.

Later criticism Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

Recent criticism Harlow, Mulaik, Steiger (1997). What if there were no significance tests?

Problems with power Cohen, J. (1969). Statistical Power Analysis for the Behavioral Sciences.

On the utility of exploration Tukey, John W (1977). Exploratory Data Analysis.

Emphasis on use of relevant graphics Tufte, Edward R. (1983). The Visual Display of Quantitative Information

Effect sizes Correlation coefficient

Pearson, K (1896). Regression, heredity and panmixia. Philosophical Transactions A. Peirce, C.S. (1884). The Numerical Measure of the Success of Predictions. Science.

Standardized mean difference Cohen, J. (1969). Statistical power analysis for the behavioral sciences.

Issues regarding causality2

Aristotle, Physics II 3. Hume, D. (1739). Treatise of human nature. Related methods: SEM, Propensity score matching

Some ‘Modern’ methods Bootstrapping

Bradley Efron (1979). "Bootstrap Methods: Another Look at the Jackknife". The Annals of Statistics 7 (1). Robust methods

Huber, P. J. (1981) Robust Statistics.3

Bayesian Bayes, T. (1764). Essay Towards Solving a Problem in the Doctrine of Chances . Robbins, H. (1956) An Empirical Bayes Approach to

Statistics, Proceeding of the Third Berkeley Symposium on Mathematical Statistics. Structural Equation Modeling

Wright, Sewall S. (1921). "Correlation of causation". Journal of Agricultural Research, 20.

The real problem The real issue is that most of these problems and issues have existed

since the beginning of statistical science, been noted since the beginning, have had many solutions offered for decades and yet much of psych research exists apparently oblivious of this or…

Are researchers simply ignoring them? Task Force on Statistical Inference initial meetings and recommendations

1996 Official paper 1999 Follow up study 2006

Statistical Reform in Psychology: Is Anything Changing? Cumming et al.

Change, but Little Reform Yet “At least in these 10 journals1, NHST continues to dominate overwhelmingly. CI

reporting is increasing but still low, and CIs are seldom used for interpretation. Figures with error bars are now common, but bars are usually SEs, not the recommended CIs...2

If we can’t expect the ‘top’ journals to change in a reasonable amount of time what are we to make of our science?

task force summary

Documents

surat task force fasa 2

health care task force

bureau fédéral du plan task force développement durable...

task force nationale de lutte contre les …

fÆllesskab om fravÆrsindsatsen københavns kommunes task...

security task force trainning

task force phoenix frsa flash

international automotive task force

l’agenda strategica dell’animal task force

economic advisory task force covid -19 pandemic …

task force innovazione - volume 3

task force 1.1 - 1° settimana

guerriero talis - teacher task force -sg

folheto task force hph-ca

task force brussel eindrapport

av task force web

101202 task force management overstromingen

task force per la sicurezza delle manifestazioni sportive...

lo_4935 task force 2012

interlink industry trends task force - convergence...