Thursday, May 3, 2012

The limitations of – and explosion in the number of – observational studies

Health News Review writes:

The limitations of – and explosion in the number of – observational studies:
In the Wall Street Journal, Gautam Naik has a thoughtful piece, “Analytical Trend Troubles Scientists,” hitting on the limitations of – and the explosion in the number of – observational studies.  Excerpts:
“While the gold standard of medical research is the randomly controlled experimental study, scientists have recently rushed to pursue observational studies, which are much easier, cheaper and quicker to do. Costs for a typical controlled trial can stretch high into the millions; observational studies can be performed for tens of thousands of dollars.
In an observational study there is no human intervention. Researchers simply observe what is happening during the course of events, or they analyze previously gathered data and draw conclusions. In an experimental study, such as a drug trial, investigators prompt some sort of change—by giving a drug to half the participants, say—and then make inferences.
But observational studies, researchers say, are especially prone to methodological and statistical biases that can render the results unreliable. Their findings are much less replicable than those drawn from controlled research. Worse, few of the flawed findings are spotted—or corrected—in the published literature.
“You can troll the data, slicing and dicing it any way you want,” says S. Stanley Young of the U.S. National Institute of Statistical Sciences. Consequently, “a great deal of irresponsible reporting of results is going on.”
Despite such concerns among researchers, observational studies have never been more popular.
Nearly 80,000 observational studies were published in the period 1990-2000 across all scientific fields, according to an analysis performed for The Wall Street Journal by Thomson Reuters. In the following period, 2001-2011, the number of studies more than tripled to 263,557, based on a search of Thomson Reuters Web of Science, an index of 11,600 peer-reviewed journals world-wide. The analysis likely doesn’t capture every observational study in the literature, but it does indicate a pattern of growth over time.
A vast array of claims made in medicine, public health and nutrition are based on observational studies, as are those about the environment, climate change and psychology.”
The article addresses the “hot area of medical research” – the search for biomarkers.
“The presence or absence of the biomarkers in a patient’s blood, some theorized, could indicate a higher or lower risk for heart disease—the biggest killer in the Western world.
Yet these biomarkers “are either completely worthless or there are only very small effects” in predicting heart disease, says John Ioannidis of Stanford University, who extensively analyzed two decades’ worth of biomarker research and published his findings in Circulation Research journal in March. Many of the studies, he found, were undermined by statistical biases, and many of the biomarkers showed very little predictive ability of heart disease.
His conclusion is widely upheld by other scientists: Just because two events are statistically associated in a study, it doesn’t mean that one necessarily sets off the other. What is merely suggestive can be mistaken as causal.
That partly explains why observational studies in general can be replicated only 20% of the time, versus 80% for large, well-designed randomly controlled trials, says Dr. Ioannidis. Dr. Young, meanwhile, pegs the replication rate for observational data at an even lower 5% to 10%.
Whatever the figure, it suggests that a lot more of these studies are getting published. Those papers can often trigger pointless follow-on research and affect real-world practices.”
But the story also appropriately points out the contribution obervational studies have made:
“Observational studies do have many valuable uses. They can offer early clues about what might be triggering a disease or health outcome. For example, it was data from observational trials that flagged the increased risk of heart attacks posed by the arthritis drug Vioxx. And it was observational data that helped researchers establish the link between smoking and lung cancer.”
I have written many times about the weakness of news stories that fail to point out the limitations of observational studies and – more specifically – stories that use causal language to describe the findings from observational studies that can “only” point to statistical associations.
News consumers and health care consumers need to better understand the limitations of all studies – including randomized clinical trials.

Wednesday, April 18, 2012

Testing for balance: The power argument against t-tests is true, but not important and probably irrelevant??

Several authors have argued that it is wrong to use standard t-test when determining whether there is a significant difference between the control group and the treatment group after matching. Imai has several papers about this and Austin has collected information about how often the incorrect approach is used (here).

The core of the problem is simple: By deleting many observations in the sample one could easily make the matched groups similar since this reduce the power of the test. It is more difficult to find statistically significant differences in small groups because the standard error is bound to be large.

Although clearly true, I would argue that it is wrong to put a lot of emphasis on the power argument. It is unlikely to matter much in large samples. Reducing the number of observations in small samples greatly reduces the power, but in large samles it will not make much of a difference. The problems decreases exponentially with the number of observations.

It is still  correct, however, to be careful with standard t-tests in this context. A difference between the two groups in, for instance, gender balance, could be important in the analysis even if it is not statistically significant.

More generally one woould like to have a criterion for determining what is meant by "the best possible balance." Austin in his article follows Ho and suggests that we should minimize the standardized difference. This is simply the difference between the average in the control group and the treated group divided by an estimate of the pooled standard deviation.

Not only would we like to have such a criterion, we would also like to be able to implement it easily in an computer program. Unfortunately it turns out that the overall balance is not simply a monotonic function of the weights on the variables. This has led some, like Sekhon, to try genetic algorithms (here) which seems to fit will with these kinds of problems.

Sekhon also argues that the choice of a balance metric should not only be based on descriptive metrics (like standardized mean differences). The argument is that such criteria can lead the algorithm to focus too much on one aspect of balance while leaving out other aspects that are also relevant for to make correct causal claims. Focusing only on minimizing mean differences, for example, could lead to an increase in maximum differences which is also relevant. In short the whole distribution is important and focusing on one aspect risk worseninig balance in some other aspect.

Sekhon's solution is to focus on the ultimate aim: We are not seeking balance for its own sake, but in order to draw correct causal conclusions from the data. This requires, he argues, as least balance on the first two moments of the distribution (expected value and variance). This leads him to maximize balance based on the Kolmogorov-Smirnov test.

There are still a lot of questions about how to measure and maximize balance between groups of units with many different characteristics. Although theoretically interesting, one should, perhaps, invest some energy into determining how important the problems are. Proving that a problem exist is far from proving that is creates serious problems in general. Sometimes it is even possible to prove that a problem is less important than one believed. For instance, Cochrane proved that an analysis based on five sub-groups would eliminate at least 95% of the bias. This surprising result may not end the discussion of about how many groups to use, but it made the whole issue less important. The questions of balance metrics is not analogous, but it would be interesting to do some simulations of the importance of using different balance metrics. I suspect they do not make a big difference in most situations, but this does not, of course, imply that it is uninteresting to find the small exceptions.Knowing when something might fail is a good thing, even if you are unlikely to encounter it.


Monday, April 16, 2012

Judea Pearl and causality

Judea Pearl gave a lecture in Oslo some time ago and I just want to digest it by writing this blog post.

His main argument was that causality was not a statisticial concept. Statistical concepts are derived from the joint distribution. Causality, however, cannot be dervied from a joint distribution (alone) since changing a variable will lead to a new joint distribution. If I double the price of a product, I cannot automatically use the old joint probability distribution to infer the effect of this change. It all depends on why there was a change, other circumstances and the stability of the functions.

This is clearly true. Some may complain that not all changes will lead to large changes in the joint distribution, we may have previous knowledge that the relationship os often stable and so on, but in principle it seems clear that we cannot get causality from the observing the joint distribution alone. We need some kind of identifying assumption or mechanism. We need something to identify the effect of changes in a (structural) model: A randomized experiment, an instrumental variable, something!

Pearl's second argument was that the standard statistical and mathematical language was unable to articulate the assumptions needed to talk about causality. We need new notation, he argued, to distinguish between equations and claims about causality. In his language y = 2 x is an equation, while y:=2x or y <-- 2x symbolizes causal claims. More generally he argues that there are two approaches and languages that can be used: The first is structural equation modelling (which he approaches using arrows, graphs and do operators). The second is the potential outcome language used by Neyman-Rubin and many others. In this language the notation Y1 and Y0 is used to indicate the value of the outcome when treated vs. not.

So what, one might say? Is is so important to make the distinctions and notational investments above? Pearl has at least managed to show that the language is useful for generating surprising conclusions that have important practical implications. For instance, his graphical approach and the "do-calculus" make it easier to identify when it is possible to estimate the causal effect and how it should be done. He has also showed, somewhat surprisingly, that conditioning on some types of variables ("colliders") in a standard regression analysis, will introduce bias. Finally, using graphs and the do-calculus it is easier to focus attention of how to achieve identification of causal effects (by closing "back doors"). This is all very valuable.

The frame works, but it seems to lose at least som of its visual elegance, when we introduce a time dimension with lags, feedbacks and dose-response relationships. Pearl's answer to this was that the approach was still valid, but that in a dynamic situation one would have to imagine a succession of graphs. 

In sum it was a stimulating talk. Some might argue that the approach is slightly too much "either/or." Pearl makes a very sharp distinction between when it is possible and when it is impossible to identify a causal effect. There is no such thing as "approximate identification." This is clearly mathematically true, but sometimes the important question is not "is it possible" but "how likely it is that I have gotten closer to a decision relevant conclusion." To use an analogy: It is impossible to solve large NP problems fast, but it is possible to get an approximate solutions fast. In the same way I sometimes get the impression that Pearl's approach focuses heavily on "yes/no" questions as opposed to questions about shades ("how bad is it if x is the case compared to y when I try to identify a causal effect"). To be slightly more specific. Using a collider in a regression is bad, yes, but how bad is it? And what determines the degree to which it produces misleading results? But these, I guess, are practical problems that come after logical and conceptual clarification.





Sims on Angerist and Prischke

Sims comments (Critical)

Also useful, partly as a contrast:

"Misunderstandings between experimentalists and observationalists about causal inference" by Imai, King and Stewart

Monday, September 26, 2011

How to do Mediation Analysis with Survival Data

I recently heard Theis Lange talk about How to do Mediation Analysis with Survival Data. Lets say you want to examine the effect of socio-economic position (SEP) on long term sickness absence. Part of the effect may go through the physical work environment, part may be directly related to the socio-economic position:



To make the idea even more specific, imagine a society with farmers having a high socio-economic status (reducing the probability of becoming sick), but working in a dangerous physical environment (increasing the probability of becoming sick). The question is how much of the overall association between status and sickness absence that goes through the physical environment and how much that can be attributed directly to status.

Unfortunately there is no general frame for answering these questions. There are some approaches for specific models, but nothing that works for all cases. Lange's aim is to provide such a general frame. His approach is based on nested counterfactuals. We imagine SEP being constant and vary the mediating variable, but we also imagine the mediating variable being constant and change the status variable. This sounds easy, but as often it involves some effort to apply the idea. Lange makes a very useful contribution by showing exactly how to apply the idea in R.

I have two small comments. First of all, the language of direct and indirect effect is slightly misleading. All effects are indirect in the sense that at a finer level of detail one could specify a more detailed causal mechanism. What we are talking about is really  "effects going through the mediating mechanism" vs "all other effects that go though other mediating mechanisms." But this is obvious.

More problematic, there may be a logical problem with nested models. At least if one is not careful when interpreting the effects. Go back to the example of the farmer. Lets say we have lots of professions and we want to examine the relationship between having a job in that professions (which is a proxy for socio economic status) and sickness absence. Imagening a farmer in a farmer environment is easy. Imagining a farmer in a non-farmer environment sounds difficult but not impossible for professions that are close, but imagining an economist working in a farmer's environment is implausible. Nested counterfactual often involve implausible counterfactuals of this type. An approach that asks us to work out the effect of a mediating variable by taking the average across some possible and some impossible counterfactuals may have a problem.

To what extent it is a problem, depends on how carefult one is interpreting the model. One may, for instance, interpret socio-economic status as something much more general. Imagine qadruplets from the same background, two with occupations having similar status - economist and dentist, perhaps - but with different work environments. Another two with different status professions, but with similar work environments. This may be logically possible but it seems difficult to identify the associations in general.

In sum. Good idea, great R implementation, but uncertain about the intuition of nested models in many contexts.


'via Blog this'

UnderstandingSociety: Current issues in causation research

A blog post summarizing current strands in causation research - with many links to relevant papers.

UnderstandingSociety: Current issues in causation research:

'via Blog this'

Wednesday, September 7, 2011

Estimating a treatment effect using OLS: The problem of the implicit conditional-variance weighting

While reading Stephen L. Morgan and Christopher Winship's book Counterfactuals and Causal Inference, I came across the following statement:
“In general, regression models do not offer consistent estimates of the average treatment effect when causal heterogeneity is present.” (p. 148) 
The argument is that linear regression estimates implcitly use a weighting that is likely to be wrong. Consider an example in which the units can be divided into three strata (small, medium and large). Assume the effect of treatment differs between the strata: Treating large units have a larger effect than treating small and medium units. We also assume that there are more medium units than small and large units. The overall average effect of treatment is the average of the treatment effects in the different strata. Since there may be more of some units than others, the overall average must be weighted. If most units are medium, then the results for this group should be given more weight when calculating the overall average effect. For each strata the treatment effects is weighted by the number of units it contains realtive to the overall number of units and we get the average effects of treatment. This sounds both intuitive and correct.

What happens if we use regression to estimate the treatment effect instead of the stratified weighting described above. Using the same data one may try a linear regression to find the effect of treatment in a model that accounts for different effects in different strata (i.e. one includes dummy variables for each strata except for one). Interestingly this kind of regression implicitly uses a different weighting than the stratified maching described above. According to Morgan and Winship, the linear regression estimate implicitly include the variance of the treatment variable within each stratum when constructing the weight (p. 144). If treatment is binary, the variance is p(1-p) where p is the probability that a unit will receive treatment. The variance is highest when p=0.5 (see figure below). In short, linear regression does not only use the number of units in the strata to weight the results, but also the conditional variance. The big question, of course, is whether this weighting procedure correct results.



According to Morgan and Winship, the answer is most likely no. In order to give correct results the propensity to receive treatment has to be the same in each strata or the stratum specific effect must be the same for each strata (p. 148). Morgan and Winship claim that the first is almost never true because the purpose of controlling for the strata in the regression is that they have different propensities to get the treatment. I have to admit I was not convinced by this since the justification for including the strata could be that the units respond differently once they get treatment, not that there is a difference in the propensity to get treatment in the different strata. True, sometimes the problem is that different groups have different probabilities of receiving treatment, but not always? I may be mistaken here, but I do not want to just accept it right away. The second is more likely to be false: One often suspects that different strata respond differently to the same treatment. If this is captured by the model, it does not creat a problem, but if one uses a linear regression and there is non-linear heterogeneity, one may end up with very wrong results.

While accepting that this implicit weighting can create problems for regression estimates, I am slightly more uncertain both about how often the problem will occur and to what extent it will create problems when one of the conditions are not fulfilled. Morgan and Winship provide some helpful calculations showing that it is possible to get estimates that are very wrong. This is interesting, but the possibility does not demonstrate its generality.