task force summary
Post on 11-Feb-2016
29 Views
Preview:
DESCRIPTION
TRANSCRIPT
TASK FORCE SUMMARY
Method Design
Don’t ‘pretend’ it’s something it’s not Hypothesis generating vs. Hypothesis testing
Or exploratory vs. confirmatory Both can be of great value and they are not
mutually exclusive even within a study Populations can be anything, make sure it’s
clear which you are trying to speak to Sampling
Can actually be a quite complex undertaking, make sure it’s clear how the data was arrived at
Method Random assignment
Critical in experimental design Do not think you are random
Humans are terrible at it, e.g. let software decide assignment1
In cases of non-experimental design, ‘comparison’ groups may be implemented but are not true controls and should not be implied as such
Control can be introduced via design and analysis Random assignment and control do not
provide causality Causal claims are subjective ones made based on
evidence, control of confounds, contiguity, common sense etc.
Measurement Variables
Precision in naming is a must Variable names should reflect operational definitions of constructs
For example: Intelligence no, IQ test score yes Nothing about how that value is derived should be left to question
Range and calculations must be made extremely clear Instruments
Reliability standards in psychology are low, and somehow getting worse The easiest way to ruin a study and waste a lot of time is using a poor
measure; It only takes one to muck up everything You are much better off assuming that a previously used instrument was a
bad idea than assuming that it’s ok because someone else used it before Even when using a well-known instrument you should report the
reliability for your study whenever possible. This not only informs about what populations a measure may or may not be
reliable for, it is crucial for meta-analysis Recall that there is no single ‘Reliability’ for an instrument, there are
reliability estimates for that instrument for various populations
Measurement Procedure
Methods of collection must be sound and every aspect about it must be communicated so others can be sure of lack of bias
“Missing” data can be accounted for in a variety of ways this day and age And the worst way to do it is completely ignoring incomplete
cases, which can introduce extreme bias into a study Power and sample size
Don’t be lazy, get a big sample. It is very easy to calculate the sample size needed for
typical analyses However there are many problems with such estimates both
theoretical and practical as we will discuss later The main thing is that it should be clear how the present
sample size was determined
Results Complications
Obviously any problems that arise should be made known You will be able to do so easily with a thorough initial
examination of data Search for outliers, miskeys etc. Test statistical assumptions Identify missing data
Inspecting your data is not fishing, snooping or whatever, it is required for doing minimally adequate research1
Visual methods are best and really highlight issues easily From the article “if you assess hypotheses without
examining your data, you risk publishing nonsense.” “If you assess hypotheses without examining your data,
you will publish nonsense.” Fixed.
Results Analysis
Your analysis is determined before data collection, not after If you do not know what analysis to run and you’ve already collected the
data, you just wasted a lot of time Theory Research Hypotheses Analysis ‘family’1 Appropriate
measures for those analyses Data collection The only exception to this is when using archival data, but then if doing that,
you have a whole host of other problems to deal with. “Do not choose an analytic method to impress your readers or to
deflect criticism.” Unfortunately it seems common in psych for researchers to choose the
analysis before the research question, mostly for the former reason (at which point they do it poorly and have the opposite effect on those who do know the analysis)
While “the simpler classical approaches” are fine, I do not agree that they should have special status if for no other reason than because neither data nor sufficiently considered research questions conform to their use except on rare occasion2. Furthermore, we also have the tools to do much better and as easily understood analyses, and saying an analysis is ‘complex’ is often more a statement about familiarity than it is about difficulty.
Results Statistical computing Regarding programs specifically
“There are many good computer programs for analyzing data.”
“If a computer program does not provide the analysis you need, use another program rather than let the computer shape your thinking.”
Regarding not letting the program do your thinking for you. “Do not report statistics found on a printout without
understanding how they are computed or what they mean.” “There is no substitute for common sense.”
Is it just me or are these very clear and easily understood statements? Would you believe I’ve actually had to defend them?
Results Assumptions
“You should take efforts to assure that the underlying assumptions required for the analysis are reasonable given the data.”
Despite this it is often difficult to find any mention of analysis of assumptions or appropriate and modern ways of dealing with the problem of not meeting them.
Hypothesis Testing “Never use the unfortunate expression ‘accept the
null hypothesis.’” Outcomes are fuzzy, that’s ok.
Results Effect sizes
“Always present effect sizes for primary outcomes.” “Always present effect sizes.” Fixed.
Small effects may still have practical importance or maybe that finding is more important to others than to you.
Confidence intervals Reporting uncertainty of estimate is important. Do
it. And do it for the effect sizes. “Interval estimates should be given for any effect sizes
involving principal outcomes”
Results Multiple comparisons/tests
First, pairwise methods… were designed to control a familywise error rate based on the sample size and number of comparisons. Preceding them with an omnibus F test in a stagewise testing procedure defeats this design, making it unnecessarily conservative.
Second, researchers rarely need to compare all possible means to understand their results or assess their theory; by setting their sights large, they sacrifice their power to see small.
Third, the lattice of all possible pairs is a straightjacket; forcing themselves to wear it often restricts researchers to uninteresting hypotheses and induces them to ignore more fruitful ones.
Again, fairly straightforward in the recommendation of not ‘laying waste with t-tests’.
Results “There is a variant of this preoccupation with all possible
pairs that comes with the widespread practice of printing p values or asterisks next to every correlation in a correlation matrix… One should ask instead why any reader would want this information.” People do not need an asterisk to tell them whether a
correlation is strong or not. The correlation is an effect size and should be treated
accordingly Humans are good pattern recognizers, if there is a trend they
will likely spot it on their own or you might make it more apparent in summary statements that highlight such patterns. Putting asterisks all over the place1 doesn’t imply anything more than that you are going to prop up poor results with statistical significance, or worse, that some ‘fishing’ went on.
Results Causal claims
Establishing causality is tricky business, especially since it can’t technically be done There is no causality statistic, and neither
causal modeling nor experimentation establish it in and of themselves
However, we do assume causal relations based on evidence and careful consideration of the problem itself, but be prepared for a difficult undertaking in attempting to establishing.
Results Tables and figures People simply do not take enough time or put enough thought into how
their results are displayed Like anything else, you need to be able to hold your audience’s attention People spend a lot of time going back over tables and figures, and more than
they do rereading the text. It is very easy to display a lot of pertinent information in a fairly simple
graph, and this is the goal: max info min clutter. Furthermore, what can be displayed in in a meaningful way graphically
is not restricted1
Any number of graphs you’ve never come across may be the best This is where you can really be creative, allow yourself to be!
Unfortunately, many limit themselves to the limitations of their statistical program, and while trying to spruce up bad graphics end up making interpretation worse E.g. 3-d bar chart Stats programs are in general behind in their offerings compared to what
graphics programs are available (obviously), and some are so archaic as to actually make customizing simple graphs a labor intensive enterprise.
Discussion Interpretation
Credibility, generalizability, and robustness Conclusions
Do not reside in a vacuum but must be placed within the context of prior and ongoing relevant studies
Do not overgeneralize. In the grand scheme of things one study is rarely worth much and no study has value without replication/validation
Thoughtfully make recommendations on issues to be addressed by future research and how they may do so “Further research must be done…” Is already known
before you started coming up with theories to test. Might as well say “Future research should be printed in black ink.”, it’d be about as useful.
The real problem The initial approach laid out
Fisher, R.A. (1925). Statistical Methods for Research Workers. Fisher, R.A. (1935). The Design of Experiments. Neyman, Jerzy (1937). "Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability", Philosophical Transactions of the Royal Society of London. Series A.1
Immediate criticism Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association. Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association.
Later criticism Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.
Recent criticism Harlow, Mulaik, Steiger (1997). What if there were no significance tests?
Problems with power Cohen, J. (1969). Statistical Power Analysis for the Behavioral Sciences.
On the utility of exploration Tukey, John W (1977). Exploratory Data Analysis.
Emphasis on use of relevant graphics Tufte, Edward R. (1983). The Visual Display of Quantitative Information
Effect sizes Correlation coefficient
Pearson, K (1896). Regression, heredity and panmixia. Philosophical Transactions A. Peirce, C.S. (1884). The Numerical Measure of the Success of Predictions. Science.
Standardized mean difference Cohen, J. (1969). Statistical power analysis for the behavioral sciences.
Issues regarding causality2
Aristotle, Physics II 3. Hume, D. (1739). Treatise of human nature. Related methods: SEM, Propensity score matching
Some ‘Modern’ methods Bootstrapping
Bradley Efron (1979). "Bootstrap Methods: Another Look at the Jackknife". The Annals of Statistics 7 (1). Robust methods
Huber, P. J. (1981) Robust Statistics.3
Bayesian Bayes, T. (1764). Essay Towards Solving a Problem in the Doctrine of Chances . Robbins, H. (1956) An Empirical Bayes Approach to
Statistics, Proceeding of the Third Berkeley Symposium on Mathematical Statistics. Structural Equation Modeling
Wright, Sewall S. (1921). "Correlation of causation". Journal of Agricultural Research, 20.
The real problem The real issue is that most of these problems and issues have existed
since the beginning of statistical science, been noted since the beginning, have had many solutions offered for decades and yet much of psych research exists apparently oblivious of this or…
Are researchers simply ignoring them? Task Force on Statistical Inference initial meetings and recommendations
1996 Official paper 1999 Follow up study 2006
Statistical Reform in Psychology: Is Anything Changing? Cumming et al.
Change, but Little Reform Yet “At least in these 10 journals1, NHST continues to dominate overwhelmingly. CI
reporting is increasing but still low, and CIs are seldom used for interpretation. Figures with error bars are now common, but bars are usually SEs, not the recommended CIs...2
If we can’t expect the ‘top’ journals to change in a reasonable amount of time what are we to make of our science?
top related