Wednesday, April 18, 2012

Testing for balance: The power argument against t-tests is true, but not important and probably irrelevant??

Several authors have argued that it is wrong to use standard t-test when determining whether there is a significant difference between the control group and the treatment group after matching. Imai has several papers about this and Austin has collected information about how often the incorrect approach is used (here).

The core of the problem is simple: By deleting many observations in the sample one could easily make the matched groups similar since this reduce the power of the test. It is more difficult to find statistically significant differences in small groups because the standard error is bound to be large.

Although clearly true, I would argue that it is wrong to put a lot of emphasis on the power argument. It is unlikely to matter much in large samples. Reducing the number of observations in small samples greatly reduces the power, but in large samles it will not make much of a difference. The problems decreases exponentially with the number of observations.

It is still  correct, however, to be careful with standard t-tests in this context. A difference between the two groups in, for instance, gender balance, could be important in the analysis even if it is not statistically significant.

More generally one woould like to have a criterion for determining what is meant by "the best possible balance." Austin in his article follows Ho and suggests that we should minimize the standardized difference. This is simply the difference between the average in the control group and the treated group divided by an estimate of the pooled standard deviation.

Not only would we like to have such a criterion, we would also like to be able to implement it easily in an computer program. Unfortunately it turns out that the overall balance is not simply a monotonic function of the weights on the variables. This has led some, like Sekhon, to try genetic algorithms (here) which seems to fit will with these kinds of problems.

Sekhon also argues that the choice of a balance metric should not only be based on descriptive metrics (like standardized mean differences). The argument is that such criteria can lead the algorithm to focus too much on one aspect of balance while leaving out other aspects that are also relevant for to make correct causal claims. Focusing only on minimizing mean differences, for example, could lead to an increase in maximum differences which is also relevant. In short the whole distribution is important and focusing on one aspect risk worseninig balance in some other aspect.

Sekhon's solution is to focus on the ultimate aim: We are not seeking balance for its own sake, but in order to draw correct causal conclusions from the data. This requires, he argues, as least balance on the first two moments of the distribution (expected value and variance). This leads him to maximize balance based on the Kolmogorov-Smirnov test.

There are still a lot of questions about how to measure and maximize balance between groups of units with many different characteristics. Although theoretically interesting, one should, perhaps, invest some energy into determining how important the problems are. Proving that a problem exist is far from proving that is creates serious problems in general. Sometimes it is even possible to prove that a problem is less important than one believed. For instance, Cochrane proved that an analysis based on five sub-groups would eliminate at least 95% of the bias. This surprising result may not end the discussion of about how many groups to use, but it made the whole issue less important. The questions of balance metrics is not analogous, but it would be interesting to do some simulations of the importance of using different balance metrics. I suspect they do not make a big difference in most situations, but this does not, of course, imply that it is uninteresting to find the small exceptions.Knowing when something might fail is a good thing, even if you are unlikely to encounter it.


Monday, April 16, 2012

Judea Pearl and causality

Judea Pearl gave a lecture in Oslo some time ago and I just want to digest it by writing this blog post.

His main argument was that causality was not a statisticial concept. Statistical concepts are derived from the joint distribution. Causality, however, cannot be dervied from a joint distribution (alone) since changing a variable will lead to a new joint distribution. If I double the price of a product, I cannot automatically use the old joint probability distribution to infer the effect of this change. It all depends on why there was a change, other circumstances and the stability of the functions.

This is clearly true. Some may complain that not all changes will lead to large changes in the joint distribution, we may have previous knowledge that the relationship os often stable and so on, but in principle it seems clear that we cannot get causality from the observing the joint distribution alone. We need some kind of identifying assumption or mechanism. We need something to identify the effect of changes in a (structural) model: A randomized experiment, an instrumental variable, something!

Pearl's second argument was that the standard statistical and mathematical language was unable to articulate the assumptions needed to talk about causality. We need new notation, he argued, to distinguish between equations and claims about causality. In his language y = 2 x is an equation, while y:=2x or y <-- 2x symbolizes causal claims. More generally he argues that there are two approaches and languages that can be used: The first is structural equation modelling (which he approaches using arrows, graphs and do operators). The second is the potential outcome language used by Neyman-Rubin and many others. In this language the notation Y1 and Y0 is used to indicate the value of the outcome when treated vs. not.

So what, one might say? Is is so important to make the distinctions and notational investments above? Pearl has at least managed to show that the language is useful for generating surprising conclusions that have important practical implications. For instance, his graphical approach and the "do-calculus" make it easier to identify when it is possible to estimate the causal effect and how it should be done. He has also showed, somewhat surprisingly, that conditioning on some types of variables ("colliders") in a standard regression analysis, will introduce bias. Finally, using graphs and the do-calculus it is easier to focus attention of how to achieve identification of causal effects (by closing "back doors"). This is all very valuable.

The frame works, but it seems to lose at least som of its visual elegance, when we introduce a time dimension with lags, feedbacks and dose-response relationships. Pearl's answer to this was that the approach was still valid, but that in a dynamic situation one would have to imagine a succession of graphs. 

In sum it was a stimulating talk. Some might argue that the approach is slightly too much "either/or." Pearl makes a very sharp distinction between when it is possible and when it is impossible to identify a causal effect. There is no such thing as "approximate identification." This is clearly mathematically true, but sometimes the important question is not "is it possible" but "how likely it is that I have gotten closer to a decision relevant conclusion." To use an analogy: It is impossible to solve large NP problems fast, but it is possible to get an approximate solutions fast. In the same way I sometimes get the impression that Pearl's approach focuses heavily on "yes/no" questions as opposed to questions about shades ("how bad is it if x is the case compared to y when I try to identify a causal effect"). To be slightly more specific. Using a collider in a regression is bad, yes, but how bad is it? And what determines the degree to which it produces misleading results? But these, I guess, are practical problems that come after logical and conceptual clarification.





Sims on Angerist and Prischke

Sims comments (Critical)

Also useful, partly as a contrast:

"Misunderstandings between experimentalists and observationalists about causal inference" by Imai, King and Stewart