Over the next few weeks I am going to read and comment on a series of papers about propensity score analysis. The comments are highly selective and do not represent a general evaluation of the articles. The first article is Deheija, R.H. and S. Wahaba (2002) Propensity score-matching methods for nonexperimental causal studies, Review of Economics and Statistics 84(1): 151-161.
Summary
This article is an introduction to propensity score mathing using the example of how to evaluate the effect of work training programs. Training programs are well suited to test the method since there is a great deal of data available, including data from randomized experiments. One can then compare the results from the randomized experiment and analysis based on propensity score adjustment using observational data. The authors also presents detailed information abour how different choices in the analysis affects the results (matching with or without replacement, varying the number of comparison units and different matching methods).
If one simply compares average earnings among those who participated in the programs and those that did not, those who did had earnings that were about $8500 lower than those who paricipated in the program. This, of course, does not imply that the program had no effect since there is a bias. Those who entered the program generally have more problems and lower earnings than the rest of the population. To find the causal effect one can conduct randomized experiments and in this case experiments indicates that on average the training increased annual earnings for those who participated by $1794.
Depending on the sample used and the different choices made in the analysis, the propensity score analysis suggested that the effect was between -916 and +1928. This is a wide interval, but the main factor driving the differences in the result are the use of diffferent samples, not the choice of matching methods. Comparing the sample from the National Suported Work (NS) Demonstration to data from the Current Population Survey (CPS) gives treatment effects between 1119 and 1605, while comparing it using the Population Survey of Income Dynamics results in answers from -916 to +1928. The main problem here is that when matching without replacement the number of observations becomes small and the results become unreliable.
The importance of distinguishing between statistical and economic significance
The general idea of matching and propensity score analysis is to compare like-with-like. To test whether this is the case, many authors seem content with traditional statistical t-tests of whether the sub-groups being compared have similar covariates (age, health, education and so on). For instance, in the present article the authors note that "None of the differences between the matched group and the NSW sample are statistically significant" (p. 157).
It seems to me that this approach makes it too easy to conclude that the groups are similar enough to be compared. In traditional hypothesis testing the null hypothesis is that the groups are similar. Given that the statistical test is biased toward not rejecting the null-hypothesis, it takes quite a big difference to conclude that the groups are different. In short, the playing field is not equal. It becomes easy to claim that "I now compare like-with-like" when you require a lot of evidence to change your mind. depending on the chosen level of significance, they will favour the null-hypothesis of similarity unless that are 95% certain that the groups are differenty are different.
What should be done? First of all one should examine the covariates for potentially important differences even if they are not statistically significant. The problem, of course, is that this injects subjectivity into the process. Who is to judge which covariate is important to balance and by how much? However, this subjectivity is unavoidable and may be desirable in cases when one has prior information about the importance of some variables. However, the issue may also have implications about how to do propensity score analysis. So far it seems like people have used the kitchen-sink method of throwing lots of possible covariates into the analysis. Some of these are obsviously important, while others are a priori less important and are thrown in "just to be sure." Given that the computer does not know the difference, it may be wrong to try equally hard to get balance on all covariates and it may also be problematic to use the kitchen sink approach.