Exclusion of subjects can make the study biased or less generalizable.
4.1 Who was excluded at the start of the study? Excessively strict entry criteria in a research study can make it difficult to extrapolate to the types of patients that you normally see.
4.2 Who dropped out during the study? A large number of drop-outs during the course of a research study can bias the final conclusions.
Chapter 5: How much did things change?
Introduction
It's not enough just to assess statistical significance in a study. You need to also make sure that the difference has a practical impact, that it represented a clinically relevant outcome, and that there were sufficient number of patients to provide reasonable precision.
When you are looking at how much things changed, ask yourself the following questions:
5.1 Did the authors measure the right thing?
5.2 Was the change clinically significant?
5.3 Were there enough subjects?
Non-steroidal anti-inflammatory drugs
A 1987 study of non-steroidal anti-inflammatory drugs (NSAID) showed that patients who took these drugs were 50% more likely to develop upper gastrointestinal (UGI) bleeding. This rate was statistically significant at alpha=.05. UGI bleeding, however, was rare in both groups. Only 1 case per thousand person years in the controls, 1.5 in the NSAID group. If you see 100 patients a year, you would have to wait two decades, more or less, in order see one excess event of bleeding, on average.
In this article, the authors were up front about the very small increase in risk. Most authors, however, are so relieved to achieve statistical significance that they forget to consider whether the size of the difference will improve clinical practice.
This is summarized well in the following Gertrude Stein quote :"For a difference to be a difference it has to make a difference"
5.1 Did the authors measure the right thing?
There is a tendency to focus on intermediate measures that are easy to assess, but which may or may not be predictive of more important endpoints. Improvement in forced expiratory volume may not translate into a reduction in asthma attacks. A reduction in abnormal ventricular depolarization may not translate into a reduction in the recurrence of heart attacks. If an intermediate endpoint is used, ask yourself whether there is an adequate link between this endpoint and something that is relevant to your patients.
Be careful that you don抰 focus solely on the outcomes mentioned in the abstract. There is a tendency to report only in the abstract the outcome measures that were statistically significant, rather than the outcome measures most of interest to health care professionals.
Also always consider whether the researcher provided adequate inspection of side effects.
Measurement error
Measurement error is simply the inability to measure an important variable accurately. Measurement error in the outcome variable does not ordinarily cause bias, buy measurement error in factors that can predict the outcome are of serious concern.
There are several ways to assess dietary fat intake. The most accurate (and also the most costly) way is through the use of prospectively recorded food diaries.
Sometimes the cost limitations or the retrospective nature of a research study will require a less accurate assessment of dietary fat, such as through an interview. Shapiro (1997) points out that estimation of dietary fat using interviews tends to correlate poorly with estimation using prospective diaries. This would cast doubt, for example, on retrospective studies that tried to associate dietary fat intake with the risk of breast cancer.
Unvalidated measures
[Discuss]
Short term measures
[Discuss]
Retrospective data
Retrospective data are data collected by looking backwards in time. We obtain this data by asking subjects to recall events that occurred earlier in their lives. We also get retrospective data when we review medical records, birth certificates, death certificates, or other sources of historical data. In contrast, data collected during the course of the study is known as prospective data.
Retrospective data are often inexpensive to collect, but you should be concerned about their accuracy. The ability of a subject to recall information is sometimes affected by which group that they are in.
Women who have experienced miscarriages, for example, are more likely to search for and remember events that they feel might "explain" their miscarriage, much more so than a group of comparable control subjects. This differential level of reporting is known as recall bias.
In addition, historical data are often incomplete and it is sometimes difficult to verify their accuracy. Therefore, retrospective data are considered less authoritative than prospective data.
An example of recall bias.
An interesting review of the research process that helped establish that smoking causes lung cancer can be found in Gail (1996). One aspect of the research process was addressing the issue of recall bias.
Doll (1950) studied the association between tobacco smoking and cancer. They selected 709 patients with lung cancer and an equal number of matched controls. The authors were concerned about the retrospective assessment of smoking among patients in both groups. Would patients with lung cancer exaggerate the amount of smoking? Would the interviewers press harder for information about smoking among the cancer patients?
While it would be impossible to totally rule out recall bias, the authors did examine a third group, patients who were diagnosed with lung cancer and who later found out that they suffered from a different disease (false cases). If recall bias was the sole explanation of the difference in reported smoking, then the group of false cases should have had a similar level of smoking with the lung cancer patients. Instead they reported a lower level of smoking. This helped to rule out the possibility that recall bias alone accounted for the higher reported smoking levels in the lung cancer patients.
5.2 Was the change clinically significant?
Research results should be quantifiable. Look for measurements of important outcomes that are free from bias.
"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science." William Thomson Kelvin (Lord Kelvin)
Knowing that a new therapy is better is not enough information. You need to quantify how much the new therapy is better. In this respect, confidence intervals are better than p-values. A p-value tells you whether the new therapy is better. A confidence intervals tells you whether the new therapy is better and by how much. A confidence interval allows you to balance the size of the improvement against the possibility of greater cost or more side effects. Many journals now require confidence intervals instead of p-values.
Statistical methods are sometimes able to detect differences that are so small as to be meaningless from any practical perspective. This is known as statistical significance without clinical significance. Always put the numbers into the perspective of your practice. Try to estimate how of the patients you see within a year are likely to perform better under the new therapy.
5.3 Were there enough subjects?
Every research study, especially negative studies, should justify the sample size chosen. It is unethical to perform research on humans or animals without first demonstrating that the sample size you have chosen is appropriate.
Justification of sample size is particularly important for a negative study (one where no difference between the standard and new therapies were found) and in studies assessing the equivalence of two therapies.
How can you tell if the sample size is too small?
Ideally, the authors should provide justification of the sample size in the paper itself. The justification is considered better if it is made a priori (prior to the start of the data collection). If no justification of sample size (e.g., power calculations) is given, examine the width of the confidence intervals. Very wide intervals indicate an inadequate sample size.
There are many examples of studies with inadequate sample sizes.
A revealing study of inadequate sample size appears in Freiman 1992. In a series of 71 publications appearing between 1960 and 1977, the outcome was either percent mortality, percent complications, or a similar outcome that could be measured as a percentage. The authors examined power, the ability of the study to detect either a moderate i


