Understanding scientific research [2]: Types and quality of evidence
In conducting scientific research it’s imperative to know how to evaluate the scientific literature and recognize the different nature and strength of scientific evidence available. Evidence is important to figure out what is true, but not to defend what you wish to be true.
Scientific beliefs and conclusions must be based on solid evidence, but because there are several kinds of evidence it’s not an easy task and not obvious how one should interpret evidence. There are different kinds of evidence with different strengths and weaknesses. Methods of balancing and comparing the different kinds of evidence are required.
The first distinction of scientific evidence is between experimental and observational evidence. The first step is how to assess an individual study. This has already been explored on part 1.
Then, all available evidence has to be balanced and compared.
1. Experimental versus Observational Studies
Experimental studies control as many variables as possible to measure a specific outcome. The goal is to isolate one variable so that the specific effects are determined.
The strengths of experimental studies include:
1. Controlling and isolating variables.
2. Quantitative in measuring some specific feature or outcome.
3. Statistical because there are comparison groups.
Weaknesses are also present:
1. Artefacts.
2. Interfering with a system may change its behavior.
3. May not be representative of real-world experiences.
4. May not be practical. There are certain kinds of experimental studies simply cannot be performed, for example exposing a person or group to toxins and anything that puts their well-being of life at risk.
Observational studies ideally do not intervene, they observe the world with no specific intervention. These are useful for correlations, to correlate a risk factor to a known disease for example. Or for palaeontology and archaeology, discovering and examining fossils, for astronomy observing the light from stars.
Strengths of observational studies:
1. Large amounts of data by observing what already exists.
2. Group comparisons.
3. Minimal intervention in the natural behavior of the system.
Weaknesses:
1. Do not control many variables.
2. Always subject to unknown variables.
3. Demonstrate correlation but cannot establish definitively cause and effect.
This two types of evidence, experimental and observational, are complementary and work together to provide different kinds of information with different strengths and weaknesses.
2. Examining the Data
Studies with only a few subjects (observational or experimental) are likely to be erroneous, because of the greater noise-to-signal ratio. Random effects to average out with large studies and sample sizes.
The statistical significance of results is often expressed as a P-value: the probability to get the results given the null hypothesis, which is the hypothesis that the phenomenon studied does not exist as opposed to the evidence establishing that the phenomenon does exist.
A p-value=0.05 means that 1 in 20 studies, where the null hypothesis is in fact true, will still give a positive result - 1 false positive result.
Barely significant results are not as compelling as highly significant results, because we can expect from chance alone the literature to be full of studies that are false positive.
However, statistical significance is not everything. A systematic flaw or bias in a study design can systematically bias the results in one direction (1).
The effect size is also needed, and we need to consider also how large the effect is. The smaller the effect size, the greater the probability that some subtle bias influenced the outcome, therefore tiny effect sizes are always tricky, even effect sizes that are right at the limit of our ability to detect them.
We also need to consider the drop-out rate of a study. A drop-out rate between 10-20% seriously reduces the reliability of the outcome. All data must be counted to avoid creating false results if only a subset of the data is counted. Selecting data can systematically bias results and make the outcome misleading.
Every single study must be systematically reviewed and then published, this is another way of counting all the data of all of the studies.
Statisticians do what’s called funnel plots, the scatter of results from different studies on the same question.
The literature if full of preliminary studies, where there’s large bias toward being falsely positive. However, as more rigorous studies are conducted, they begin to show the real effect.
As better studies are conducted over time effect sizes tend to shrink – this is called the decline effect. Even when the effect is real, the effect size is usually small in the more rigorous studies. For non-existent phenomena, the effect size shrinks to zero. The less rigorous studies tend to be more variable in their results and more shifted or biased towards the positive, however that gets worked out as studies become more rigorous.
3. Prospective and retrospective
A prospective study observes the behavior and outcome, conversely a retrospective study looks back at events and outcomes that have already occurred.
In principle prospective studies are more rigorous, this is because there are fewer confounding factors, are more systematic, and the samples are more representative. However for retrospective studies, there’s the potential for multiple bias.
4. Blinded and double blinded
In any rigorous study where the outcome of the results is be blinded, scientists are blinded to what they’re looking for in the intervention or in the control group. This reduces the subconscious researcher bias in the results. Experimental studies generally should be blinded in order to be reliable. Observational studies, on the other hand, can only be partially blinded.
The bias of the researcher can influence the results, in other words the results tend to be what they expect. Erroneous results disappear when proper blinding is put into place.
5. Controls
The controls must also be adequate. For example, what is the subject of the study being compared to? In medical trials an active control may obscure the comparison to the treatment, or the placebo may cause a negative outcome, making the experimental treatment seem artificially better. Perhaps the standard treatment to which a new treatment is being compared may be ineffective, making the new treatment seem more effective than it really is.
The control group being studied must also be representative of the population of interest. For example the 1936 Literary Digest poll failed mostly due to unrepresentative sample size and failure “to include the supposed core of Roosevelt’s support, the poor… and were excluded from the pool” (1). However others also pointed out “ both the sample and the response rate as being flawed”, and asserted that “the initial bias towards over-representation of Republicans in the sample was exacerbated by the fact that better-educated and wealthy people who tended to be Landon supporters were more likely to respond to the survey.” (1).
6. Examining the literature
Individual studies can be preliminary and flawed, or they can be rigorous and methodologically sound, but either way a single study is a single study. Few studies are so large, rigorous, and unambiguous in outcome that they can stand alone and can be considered definitive studies. Individual studies need to be put into the context of the overall research, or the published literature.
We must evaluate if an individual study has been replicated by independent labs and researchers. If so, we must evaluate if the results are consistent or mixed. In addition, did the results look at the same thing and control for the same variables?
When looking at any research question all literature must be considered and put it into context.
7. Publication bias
Publication bias is the tendency for researchers to push their results to be published when the results are interesting, positive, and somehow good for their career and reputation. Journal editors can have a bias toward publishing positive studies, potentially creating good press releases and draw attention to their journal.
8. Meta-analysis
Meta-analysis looks at many different studies addressing similar questions, it’s a study of studies which combines the results of multiple studies into a new statistical analysis for a greater statistical power.
However, there’s still new possibilities for bias. If the preliminary studies were poorly designed and biased, the meta-analysis will still reflect the bias of those preliminary studies. In fact, a meta-analysis a poor predictor of the outcome of later large, definitive studies, only about 60-70% (2).
9. Systematic reviews
Systematic reviews also look at all the evidence and consider the quality of each study. They look for patterns in the literature, consistency, replication, and relation to effect size and study quality.
Like meta-analysis, systematic reviews are also subject to bias. We need to look at which studies are included; the inclusion criteria; the methods used to find studies. All of these can affect the outcome of the systematic review.
The take home message is, all this design in scientific studies and methods of evaluating the literature is a way of compensating for our biases, flaws, fallacies. Evidence should and must be used to figure out what is true, rather than to defend what we already wish to be true.

References
1. Squire, Peverill. “Why the 1936 Literary Digest Poll Failed. Public Opin Q (1988) 52 (1): 125-133.
2. LeLorier, Jacques, Geneviève Grégoire, Abdeltif Benhaddad, Julie Lapierre, and François Derderian. “Discrepancies between Meta-Analyses and Subsequent Large Randomized, Controlled Trials.” New England Journal of Medicine 337 (1997): 536–542.
Novella, Steven. “Evidence in Medicine: Correlation and Causation.” Science-Based Medicine. http://www.sciencebasedmedicine.org/index.php/evidence-in-medicine-correlation-and-causation
Novella, Steven. “Evidence in Medicine: Experimental Studies.” Science-Based Medicine. http://www.sciencebasedmedicine.org/index.php/evidence-in-medicine-experimental-studies
Taper, Mark L.,and Subhash R. Lele. The Nature of Scientific Evidence: Statistical, Philosophical, and Empirical Considerations. Chicago: University of Chicago Press, 2004.