The results of the Freiman study were very disappointing.
Of the 71 papers, 57 had greater than a 50% chance for missing a moderate improvement and 31 had a 50% or greater chance for missing a large improvement.
One wonders why anyone would undertake a study when there is such a high probability for failure. You should never initiate a study unless you know that the chance of missing a reasonable improvement is less than 20%.
Special issues in a study of equivalency.
Some studies attempt to show not that a new therapy is superior to the standard therapy, but that it is equivalent. Showing equivalence requires a very careful assessment of sample size.
An example of an equivalence study is when a drug company tests a generic drug and wishes to show equivalence with the (presumably more expensive) brand name drug.
If we applied the traditional testing approach, the company would have a strong disincentive to design the study with an adequate sample size. A small sample size is more likely to show equivalency under the traditional testing framework.
There are several modifications to the traditional testing framework for equivalency studies. The simplest approach uses confidence interval for the ratio of the outcome under new therapy to the outcome under the standard therapy. If both limits of the confidence interval are reasonably close to 1 (e.g., no less than 0.8 and no more than 1.25) then the two therapies are considered equivalent.
Summary - How much did things change?
Research results should be quantifiable. Look for measurements of important outcomes that are free from bias.
5.1 Was there a quantitative measure of the size of the effect? Look for a confidence interval and compare the size of the effect to what you would expect to see in your practice.
5.2 Could other factors account for this effect? Look for differences in demographics between the two groups and ask if these differences could explain the results of the research.
5.3 Were any important outcomes forgotten? Research results should focus on endpoints that are of interest to your patients.
Additional resources
Rating health information on the Internet: navigating to knowledge or to Babel? Jadad, A. R. and A. Gagliardi (1998). Jama 279(8): 611-4.
CONTEXT: The rapid growth of the Internet has triggered an information revolution of unprecedented magnitude. Despite its obvious benefits, the increase in the availability of information could also result in many potentially harmful effects on both consumers and health professionals who do not use it appropriately. OBJECTIVES: To identify instruments used to rate Web sites providing health information on the Internet, rate criteria used by them, establish the degree of validation of the instruments, and provide future directions for research in this area. DATA SOURCES: MEDLINE (1966-1997), CINHAL (1982-1997), HEALTH (1975-1997), Information Science Abstracts (1966 to September 1995), Library and Information Science Abstracts (1969-1995), and Library Literature (1984-1996); the search engines Lycos, Excite, Open Text, Yahoo, HotBot, Infoseek, and Magellan; Internet discussion lists; meeting proceedings; multiple Web pages; and reference lists. INSTRUMENT SELECTION: Instruments used at least once to rate the quality of Web sites providing health information with their rating criteria available on the Internet. DATA EXTRACTION: The name of the developing organization, Internet address, rating criteria, information on the development of the instrument, number and background of people generating the assessments, and data on the validity and reliability of the measurements. DATA SYNTHESIS: A total of 47 rating instruments were identified. Fourteen provided a description of the criteria used to produce the ratings, and 5 of these provided instructions for their use. None of the instruments identified provided information on the interobserver reliability and construct validity of the measurements. CONCLUSIONS: Many incompletely developed instruments to evaluate health information exist on the Internet. It is unclear, however, whether they should exist in the first place, whether they measure what they claim to measure, or whether they lead to more good than harm.
Chapter 6: Special guidelines for overviews and meta-analyses
Introduction
Meta-analysis is the quantitative pooling of data from two or more studies. When you are examining the results of a meta-analysis, you should ask the following questions:
6.1 Were apples combined with oranges? Heterogeneity among studies may make any pooled estimate meaningless.
6.2 Were all of the apples rotten? The quality of a meta-analysis cannot be any better than the quality of the studies it is summarizing.
6.3 Were some apples left on the tree? An incomplete search of the literature can bias the findings of a meta-analysis.
6.4 Did the pile of apples amount to more than just a hill of beans? Make sure that the meta-analysis quantifies the size of the effect in units that you can understand.
Declining sperm counts
In 1992, the British Medical Journal published a controversial meta-analysis. This study (Carlsen et al 1992) reviewed 61 papers published from 1938 and 1991 and showed that there was a significant decrease in sperm count and in seminal volume over this period of time. For example, a linear regression model on the pooled data provided an estimated average count of 113 million per ml in 1940 and 66 million per ml in 1990.
Several researchers (Fisch and Goluboff 1996, Olsen et al 1995) noted heterogeneity in this meta-analysis, a mixing of apples and oranges. Studies before 1970 were dominated by studies in the United States and particularly studies in New York. Studies after 1970 included many other locations including third world countries. Thus the early studies were United States apples. The later studies were international oranges. There was also substantial variation in collection methods, especially in the extent to which the subjects adhered to a minimum abstinence period.
The original meta-analysis and the criticisms of it highlight both the greatest weakness and the greatest strength of meta-analysis.
Meta-analysis is the quantitative pooling of data from studies with sometimes small and sometimes large disparities. Think of it as a multi-center trial where each center gets to use its own protocol and where some of the centers are left out.
On the other hand, a meta-analysis lays all the cards on the table. Sitting out in the open are all the methods for selecting studies, abstracting information, and combining the findings. Meta-analysis allows objective criticism of these overt methods and even allows replication of the research.
Contrast this to an invited editorial or commentary that provides a subjective summary of a research area. Even when the subjective summary is done well, you cannot effectively replicate the findings. Since a subjective review is a black box, the only way, it seems, to repudiate a subjective summary is to attack the messenger.
Meta-analysis is used in a variety of different areas. Vine et al 1994 used meta-analysis studied the relationship between smoking and sperm concentration. Oehninger et al 2000 assessed the utility of sperm function assays in predicting successful outcomes in IVF. Goldberg et al 1999 compared intrauterine and intracervical insemination with frozen donor sperm. Evers et al 2001 reviewed the effectiveness of varicocelectomy in subfertile men.
6.1 Were apples combined with oranges?
Meta-analyses should not have too broad an inclusion criteria. Including too many studies can lead to problems with "apples-to-oranges" comparisons. Example: When studying the effect of cholesterol lowering drugs, it makes no sense to combine a study of patients with recent heart attacks with another study of patients with high cholesterol but no previous heart attacks.
There is a lot of variability in how research is conducted. Even in carefully controlled randomized control trials, researchers have tremendous discretion. Sometimes this discretion creates heterogeneity among studies, making it difficult to combine the studies.
Heterogeneity in the composition of the treatment and control groups.
Researchers can differ in the inclusion and exclusion criteria.
Even if these criteria do not differ, there may still be differences in the baseline levels of health in the patients, due to geographical differences in the patient population.
The controls could be selected independently, or they could be matched to the treatment group subjects.
The control subjects could be given no treatment, a placebo, or a standard treatment.


