Several authors have argued that it is wrong to use standard t-test when determining whether there is a significant difference between the control group and the treatment group after matching. Imai has several papers about this and Austin has collected information about how often the incorrect approach is used (here).
The core of the problem is simple: By deleting many observations in the sample one could easily make the matched groups similar since this reduce the power of the test. It is more difficult to find statistically significant differences in small groups because the standard error is bound to be large.
Although clearly true, I would argue that it is wrong to put a lot of emphasis on the power argument. It is unlikely to matter much in large samples. Reducing the number of observations in small samples greatly reduces the power, but in large samles it will not make much of a difference. The problems decreases exponentially with the number of observations.
It is still correct, however, to be careful with standard t-tests in this context. A difference between the two groups in, for instance, gender balance, could be important in the analysis even if it is not statistically significant.
More generally one woould like to have a criterion for determining what is meant by "the best possible balance." Austin in his article follows Ho and suggests that we should minimize the standardized difference. This is simply the difference between the average in the control group and the treated group divided by an estimate of the pooled standard deviation.
Not only would we like to have such a criterion, we would also like to be able to implement it easily in an computer program. Unfortunately it turns out that the overall balance is not simply a monotonic function of the weights on the variables. This has led some, like Sekhon, to try genetic algorithms (here) which seems to fit will with these kinds of problems.
Sekhon also argues that the choice of a balance metric should not only be based on descriptive metrics (like standardized mean differences). The argument is that such criteria can lead the algorithm to focus too much on one aspect of balance while leaving out other aspects that are also relevant for to make correct causal claims. Focusing only on minimizing mean differences, for example, could lead to an increase in maximum differences which is also relevant. In short the whole distribution is important and focusing on one aspect risk worseninig balance in some other aspect.
Sekhon's solution is to focus on the ultimate aim: We are not seeking balance for its own sake, but in order to draw correct causal conclusions from the data. This requires, he argues, as least balance on the first two moments of the distribution (expected value and variance). This leads him to maximize balance based on the Kolmogorov-Smirnov test.
There are still a lot of questions about how to measure and maximize balance between groups of units with many different characteristics. Although theoretically interesting, one should, perhaps, invest some energy into determining how important the problems are. Proving that a problem exist is far from proving that is creates serious problems in general. Sometimes it is even possible to prove that a problem is less important than one believed. For instance, Cochrane proved that an analysis based on five sub-groups would eliminate at least 95% of the bias. This surprising result may not end the discussion of about how many groups to use, but it made the whole issue less important. The questions of balance metrics is not analogous, but it would be interesting to do some simulations of the importance of using different balance metrics. I suspect they do not make a big difference in most situations, but this does not, of course, imply that it is uninteresting to find the small exceptions.Knowing when something might fail is a good thing, even if you are unlikely to encounter it.
No comments:
Post a Comment