1.1 Did the authors use randomization? Randomization insures balance among the two therapy groups with respect to both measurable and unmeasurable factors.
1.2 Did the authors use matching? [Discuss]
1.3 Did the authors use statistical adjustments? [Discuss]
Chapter 2: Was there a plan?
Introduction
The presence of a plan developed before data collection and analysis adds to the quality of a publication.
2.1 Did the research have a narrow focus?
2.2 Did the authors deviate from the plan?
Meat consumption and childhood cancer
Studies of the effects of diet on health often have difficulties with multiple endpoints. An example is a 1994 study of the effect of cured and broiled meat consumption on childhood cancer.
This study examined two types of cancer (acute lymphocytic leukemia and brain tumor). The authors examined five types of meat consumption (ham/bacon/sausage, hot dogs, hamburgers, lunch meats, and charcoal broiled foods). Finally, the authors looked at food consumption both of the child and of the mother during pregnancy.
In the analysis, the researchers used a cut-off to compare low meat consumption to high meat consumption. For example, they compare one or more hamburgers consumed per week to less than one per week. In the text, however, they went further and discussed results with a different cut-off, children who ate two or more hamburgers per week compared to children who ate one or less per week.
This study came under a lot of criticism for its scattershot approach to investigation, though it also had its share of defenders. There's a saying in statistics "if you torture your data long enough, it will confess to something." When a research study has a plan with limited number of precisely defined hypotheses, the results are more persuasive. When the research has no pre-planned hypotheses, then the results should be considered preliminary and exploratory in nature.
2.1 Did the research have a narrow focus?
A good research study has limited objectives that are specified in advance. Failure to limit the scope of a study leads to problems with multiple testing.
When there are a large number of comparisons being made, the study is considered a fishing expedition. There is a saying in Statistics circles "If you torture your data long enough, it will confess to something."
When is multiple testing likely to occur?
Multiple testing often occurs when a researcher examines a large number of subgroups or a large number of endpoints (Howel 1994). Multiple testing problems also occur when a study examines multiple side effects.
When multiple tests are done simultaneously within a paper, there is an increase in the overall Type I error. If 100 tests were performed at alpha=.05, you would expect that 5 of those tests would be significant, even if there was nothing at all going on. There are statistical adjustments for multiple comparisons, but these are controversial. Significant results from a large number of unplanned comparisons are useful mostly just for setting future research priorities.
Optimal cut points and the problem with multiple comparisons.
Researchers will often simplify analysis of a continuous outcome measure by dividing that measure into two or more distinct groups on the basis of cut points. For example, a researcher might categorize his/her subjects as high or low blood pressure when they are above or below a certain value.
An abuse of this approach, called the minimum p-value approach, was noted by Altman (1994). Researchers would examine a variety of cut points and select the one that yielded the most favorable statistics.
For example, some researchers have chosen the cut point from among a large number of possible cut points so as the make the difference in survival times between those patients above the cut point and those patients below the cut point as large as possible.
By examining a multiple number of cut points the chance of drawing a false conclusion (Type I Error) is inflated from the traditional 5% value to a value as large as 40%.
There are several objective ways to select a cut point. Perhaps the best way is to select the cut point prior to looking at the data. This would involve the use of medical judgment.
After the data has been collected, there are some neutral ways of selecting a cut point. The simplest is a median split. If you wanted to create a median split for blood pressure, you would combine the blood pressure data from both groups, and select a value so that half of the blood pressures are larger and half are smaller.
Subgroup analysis
Subgroup comparisons are a special case of multiple testing. Rather than looking at multiple endpoints, a subgroup analysis compares a single endpoint across several different subgroups within the data.
Subgroup comparisons suffer from three problems. First, the subgroup comparison is usually a non-randomized comparison. Second, the subgroup comparison has less precision because the sample size is smaller. Third, the sample size in a study could be swamped by the potential number of possible subgroups that could potentially be examined.
If you find a subgroup that behaves differently, then you need to ask yourself a few questions. Is this a subgroup that I would have studied a priori if I had been more careful during the planning stage? Is there a plausible mechanism to explain why this subgroup behaves differently? Are there other studies that have similar findings for this subgroup?
There are some technical issues with subgroup comparisons. You wouldn't want to declare that a therapy is effective one subgroup if the p-value for that subgroup was 0.043 and the p-value for the others was 0.062. The analysis of subgroups should be done as a formal test of interaction.
A recent publication in the International Journal of Epidemiology provides empirical evidence that post hoc analyses are more likely to lead to false positive findings.
False positive outcomes and design characteristics in occupational cancer epidemiology studies. Gerard GMH Swaen, Olga Teggeler and Ludovic GPM van Amelsvoort. International Journal of Epidemiology 2001;30:948-954. http://ije.oupjournals.org/cgi/content/abstract/30/5/948
2.2 Did the authors deviate from the plan?
Not all research is predictable, so deviations from a pre-designed plan are sometimes necessary. Nevertheless, be cautious about any major deviation from the original research protocol. Some examples of deviations from the plan include:
Investigating end-points other than those originally specified.
Developing new exclusion criteria after the study has started.
You need to ask yourself if the authors deviated from the protocol in a conscious or subconscious effort to manipulate the results. Did the authors add other end-points in order to salvage a largely negative study? Were new exclusion criteria targeted to keep "troublesome" subjects out? It is impossible, of course, to discern the motives of the researchers. Nevertheless, for any deviation or modification to the protocol, you can ask whether this change would have made sense to include in the protocol if it had been thought of before data collection began.
An example of a deviation from the research plan.
An interesting deviation from the research plan occurs in a randomized double blind control trial for the use of selenium supplements (Clark 1996). The study was initiated in 1983 with basal skin carcinoma and squamous skin carcinoma as the primary end points. The researchers also looked for signs of selenium toxicity.
In 1990, funding was obtained to look at additional secondary end points (total mortality, cancer mortality, and incidence of lung, colorectal, and prostate cancers). While it was relatively easy to add extra endpoints in the middle of the study, the authors acknowledged that this represented a deviation from the protocol.
Another deviation from the protocol occurred when the study was terminated early (January 1996). No statistical changes were found in the primary endpoints, nor was any evidence of selenium toxicity found.
Among the secondary endpoints, however, the authors found statistically significant declines in total cancer mortality and lung cancer mortality. The authors also found statistically significant declines in the incidence of prostate cancer, colorectal cancer, lung cancer and total carcinomas. There was also a decline in overall mortality, though it did not achieve statistical significance.
There were no significant changes in the incidence of nine other types of cancer, including breast cancer, bladder cancer, and leukemia.
Because the significant results occurred in areas that were not originally planned for study, the authors acknowledge that any results have to be considered preliminary. Furthermore, it is unclear what


