The propensity score blog: 2012

Thursday, May 3, 2012

The limitations of – and explosion in the number of – observational studies

Health News Review writes:

The limitations of – and explosion in the number of – observational studies:
In the Wall Street Journal, Gautam Naik has a thoughtful piece, “Analytical Trend Troubles Scientists,” hitting on the limitations of – and the explosion in the number of – observational studies. Excerpts:

“While the gold standard of medical research is the randomly controlled experimental study, scientists have recently rushed to pursue observational studies, which are much easier, cheaper and quicker to do. Costs for a typical controlled trial can stretch high into the millions; observational studies can be performed for tens of thousands of dollars.
In an observational study there is no human intervention. Researchers simply observe what is happening during the course of events, or they analyze previously gathered data and draw conclusions. In an experimental study, such as a drug trial, investigators prompt some sort of change—by giving a drug to half the participants, say—and then make inferences.
But observational studies, researchers say, are especially prone to methodological and statistical biases that can render the results unreliable. Their findings are much less replicable than those drawn from controlled research. Worse, few of the flawed findings are spotted—or corrected—in the published literature.
“You can troll the data, slicing and dicing it any way you want,” says S. Stanley Young of the U.S. National Institute of Statistical Sciences. Consequently, “a great deal of irresponsible reporting of results is going on.”
Despite such concerns among researchers, observational studies have never been more popular.
Nearly 80,000 observational studies were published in the period 1990-2000 across all scientific fields, according to an analysis performed for The Wall Street Journal by Thomson Reuters. In the following period, 2001-2011, the number of studies more than tripled to 263,557, based on a search of Thomson Reuters Web of Science, an index of 11,600 peer-reviewed journals world-wide. The analysis likely doesn’t capture every observational study in the literature, but it does indicate a pattern of growth over time.
A vast array of claims made in medicine, public health and nutrition are based on observational studies, as are those about the environment, climate change and psychology.”

The article addresses the “hot area of medical research” – the search for biomarkers.

“The presence or absence of the biomarkers in a patient’s blood, some theorized, could indicate a higher or lower risk for heart disease—the biggest killer in the Western world.
Yet these biomarkers “are either completely worthless or there are only very small effects” in predicting heart disease, says John Ioannidis of Stanford University, who extensively analyzed two decades’ worth of biomarker research and published his findings in Circulation Research journal in March. Many of the studies, he found, were undermined by statistical biases, and many of the biomarkers showed very little predictive ability of heart disease.
His conclusion is widely upheld by other scientists: Just because two events are statistically associated in a study, it doesn’t mean that one necessarily sets off the other. What is merely suggestive can be mistaken as causal.
That partly explains why observational studies in general can be replicated only 20% of the time, versus 80% for large, well-designed randomly controlled trials, says Dr. Ioannidis. Dr. Young, meanwhile, pegs the replication rate for observational data at an even lower 5% to 10%.
Whatever the figure, it suggests that a lot more of these studies are getting published. Those papers can often trigger pointless follow-on research and affect real-world practices.”

But the story also appropriately points out the contribution obervational studies have made:

“Observational studies do have many valuable uses. They can offer early clues about what might be triggering a disease or health outcome. For example, it was data from observational trials that flagged the increased risk of heart attacks posed by the arthritis drug Vioxx. And it was observational data that helped researchers establish the link between smoking and lung cancer.”

I have written many times about the weakness of news stories that fail to point out the limitations of observational studies and – more specifically – stories that use causal language to describe the findings from observational studies that can “only” point to statistical associations.
News consumers and health care consumers need to better understand the limitations of all studies – including randomized clinical trials.

Share|

Wednesday, April 18, 2012

Testing for balance: The power argument against t-tests is true, but not important and probably irrelevant??

Several authors have argued that it is wrong to use standard t-test when determining whether there is a significant difference between the control group and the treatment group after matching. Imai has several papers about this and Austin has collected information about how often the incorrect approach is used (here).

The core of the problem is simple: By deleting many observations in the sample one could easily make the matched groups similar since this reduce the power of the test. It is more difficult to find statistically significant differences in small groups because the standard error is bound to be large.

Although clearly true, I would argue that it is wrong to put a lot of emphasis on the power argument. It is unlikely to matter much in large samples. Reducing the number of observations in small samples greatly reduces the power, but in large samles it will not make much of a difference. The problems decreases exponentially with the number of observations.

It is still correct, however, to be careful with standard t-tests in this context. A difference between the two groups in, for instance, gender balance, could be important in the analysis even if it is not statistically significant.

More generally one woould like to have a criterion for determining what is meant by "the best possible balance." Austin in his article follows Ho and suggests that we should minimize the standardized difference. This is simply the difference between the average in the control group and the treated group divided by an estimate of the pooled standard deviation.

Not only would we like to have such a criterion, we would also like to be able to implement it easily in an computer program. Unfortunately it turns out that the overall balance is not simply a monotonic function of the weights on the variables. This has led some, like Sekhon, to try genetic algorithms (here) which seems to fit will with these kinds of problems.

Sekhon also argues that the choice of a balance metric should not only be based on descriptive metrics (like standardized mean differences). The argument is that such criteria can lead the algorithm to focus too much on one aspect of balance while leaving out other aspects that are also relevant for to make correct causal claims. Focusing only on minimizing mean differences, for example, could lead to an increase in maximum differences which is also relevant. In short the whole distribution is important and focusing on one aspect risk worseninig balance in some other aspect.

Sekhon's solution is to focus on the ultimate aim: We are not seeking balance for its own sake, but in order to draw correct causal conclusions from the data. This requires, he argues, as least balance on the first two moments of the distribution (expected value and variance). This leads him to maximize balance based on the Kolmogorov-Smirnov test.

There are still a lot of questions about how to measure and maximize balance between groups of units with many different characteristics. Although theoretically interesting, one should, perhaps, invest some energy into determining how important the problems are. Proving that a problem exist is far from proving that is creates serious problems in general. Sometimes it is even possible to prove that a problem is less important than one believed. For instance, Cochrane proved that an analysis based on five sub-groups would eliminate at least 95% of the bias. This surprising result may not end the discussion of about how many groups to use, but it made the whole issue less important. The questions of balance metrics is not analogous, but it would be interesting to do some simulations of the importance of using different balance metrics. I suspect they do not make a big difference in most situations, but this does not, of course, imply that it is uninteresting to find the small exceptions.Knowing when something might fail is a good thing, even if you are unlikely to encounter it.

Monday, April 16, 2012

Judea Pearl and causality

Judea Pearl gave a lecture in Oslo some time ago and I just want to digest it by writing this blog post.

His main argument was that causality was not a statisticial concept. Statistical concepts are derived from the joint distribution. Causality, however, cannot be dervied from a joint distribution (alone) since changing a variable will lead to a new joint distribution. If I double the price of a product, I cannot automatically use the old joint probability distribution to infer the effect of this change. It all depends on why there was a change, other circumstances and the stability of the functions.

This is clearly true. Some may complain that not all changes will lead to large changes in the joint distribution, we may have previous knowledge that the relationship os often stable and so on, but in principle it seems clear that we cannot get causality from the observing the joint distribution alone. We need some kind of identifying assumption or mechanism. We need something to identify the effect of changes in a (structural) model: A randomized experiment, an instrumental variable, something!

Pearl's second argument was that the standard statistical and mathematical language was unable to articulate the assumptions needed to talk about causality. We need new notation, he argued, to distinguish between equations and claims about causality. In his language y = 2 x is an equation, while y:=2x or y <-- 2x symbolizes causal claims. More generally he argues that there are two approaches and languages that can be used: The first is structural equation modelling (which he approaches using arrows, graphs and do operators). The second is the potential outcome language used by Neyman-Rubin and many others. In this language the notation Y1 and Y0 is used to indicate the value of the outcome when treated vs. not.

So what, one might say? Is is so important to make the distinctions and notational investments above? Pearl has at least managed to show that the language is useful for generating surprising conclusions that have important practical implications. For instance, his graphical approach and the "do-calculus" make it easier to identify when it is possible to estimate the causal effect and how it should be done. He has also showed, somewhat surprisingly, that conditioning on some types of variables ("colliders") in a standard regression analysis, will introduce bias. Finally, using graphs and the do-calculus it is easier to focus attention of how to achieve identification of causal effects (by closing "back doors"). This is all very valuable.

The frame works, but it seems to lose at least som of its visual elegance, when we introduce a time dimension with lags, feedbacks and dose-response relationships. Pearl's answer to this was that the approach was still valid, but that in a dynamic situation one would have to imagine a succession of graphs.

In sum it was a stimulating talk. Some might argue that the approach is slightly too much "either/or." Pearl makes a very sharp distinction between when it is possible and when it is impossible to identify a causal effect. There is no such thing as "approximate identification." This is clearly mathematically true, but sometimes the important question is not "is it possible" but "how likely it is that I have gotten closer to a decision relevant conclusion." To use an analogy: It is impossible to solve large NP problems fast, but it is possible to get an approximate solutions fast. In the same way I sometimes get the impression that Pearl's approach focuses heavily on "yes/no" questions as opposed to questions about shades ("how bad is it if x is the case compared to y when I try to identify a causal effect"). To be slightly more specific. Using a collider in a regression is bad, yes, but how bad is it? And what determines the degree to which it produces misleading results? But these, I guess, are practical problems that come after logical and conceptual clarification.

Sims on Angerist and Prischke

Sims comments (Critical)

Also useful, partly as a contrast:

"Misunderstandings between experimentalists and observationalists about causal inference" by Imai, King and Stewart