Monday, September 26, 2011

How to do Mediation Analysis with Survival Data

I recently heard Theis Lange talk about How to do Mediation Analysis with Survival Data. Lets say you want to examine the effect of socio-economic position (SEP) on long term sickness absence. Part of the effect may go through the physical work environment, part may be directly related to the socio-economic position:



To make the idea even more specific, imagine a society with farmers having a high socio-economic status (reducing the probability of becoming sick), but working in a dangerous physical environment (increasing the probability of becoming sick). The question is how much of the overall association between status and sickness absence that goes through the physical environment and how much that can be attributed directly to status.

Unfortunately there is no general frame for answering these questions. There are some approaches for specific models, but nothing that works for all cases. Lange's aim is to provide such a general frame. His approach is based on nested counterfactuals. We imagine SEP being constant and vary the mediating variable, but we also imagine the mediating variable being constant and change the status variable. This sounds easy, but as often it involves some effort to apply the idea. Lange makes a very useful contribution by showing exactly how to apply the idea in R.

I have two small comments. First of all, the language of direct and indirect effect is slightly misleading. All effects are indirect in the sense that at a finer level of detail one could specify a more detailed causal mechanism. What we are talking about is really  "effects going through the mediating mechanism" vs "all other effects that go though other mediating mechanisms." But this is obvious.

More problematic, there may be a logical problem with nested models. At least if one is not careful when interpreting the effects. Go back to the example of the farmer. Lets say we have lots of professions and we want to examine the relationship between having a job in that professions (which is a proxy for socio economic status) and sickness absence. Imagening a farmer in a farmer environment is easy. Imagining a farmer in a non-farmer environment sounds difficult but not impossible for professions that are close, but imagining an economist working in a farmer's environment is implausible. Nested counterfactual often involve implausible counterfactuals of this type. An approach that asks us to work out the effect of a mediating variable by taking the average across some possible and some impossible counterfactuals may have a problem.

To what extent it is a problem, depends on how carefult one is interpreting the model. One may, for instance, interpret socio-economic status as something much more general. Imagine qadruplets from the same background, two with occupations having similar status - economist and dentist, perhaps - but with different work environments. Another two with different status professions, but with similar work environments. This may be logically possible but it seems difficult to identify the associations in general.

In sum. Good idea, great R implementation, but uncertain about the intuition of nested models in many contexts.


'via Blog this'

UnderstandingSociety: Current issues in causation research

A blog post summarizing current strands in causation research - with many links to relevant papers.

UnderstandingSociety: Current issues in causation research:

'via Blog this'

Wednesday, September 7, 2011

Estimating a treatment effect using OLS: The problem of the implicit conditional-variance weighting

While reading Stephen L. Morgan and Christopher Winship's book Counterfactuals and Causal Inference, I came across the following statement:
“In general, regression models do not offer consistent estimates of the average treatment effect when causal heterogeneity is present.” (p. 148) 
The argument is that linear regression estimates implcitly use a weighting that is likely to be wrong. Consider an example in which the units can be divided into three strata (small, medium and large). Assume the effect of treatment differs between the strata: Treating large units have a larger effect than treating small and medium units. We also assume that there are more medium units than small and large units. The overall average effect of treatment is the average of the treatment effects in the different strata. Since there may be more of some units than others, the overall average must be weighted. If most units are medium, then the results for this group should be given more weight when calculating the overall average effect. For each strata the treatment effects is weighted by the number of units it contains realtive to the overall number of units and we get the average effects of treatment. This sounds both intuitive and correct.

What happens if we use regression to estimate the treatment effect instead of the stratified weighting described above. Using the same data one may try a linear regression to find the effect of treatment in a model that accounts for different effects in different strata (i.e. one includes dummy variables for each strata except for one). Interestingly this kind of regression implicitly uses a different weighting than the stratified maching described above. According to Morgan and Winship, the linear regression estimate implicitly include the variance of the treatment variable within each stratum when constructing the weight (p. 144). If treatment is binary, the variance is p(1-p) where p is the probability that a unit will receive treatment. The variance is highest when p=0.5 (see figure below). In short, linear regression does not only use the number of units in the strata to weight the results, but also the conditional variance. The big question, of course, is whether this weighting procedure correct results.



According to Morgan and Winship, the answer is most likely no. In order to give correct results the propensity to receive treatment has to be the same in each strata or the stratum specific effect must be the same for each strata (p. 148). Morgan and Winship claim that the first is almost never true because the purpose of controlling for the strata in the regression is that they have different propensities to get the treatment. I have to admit I was not convinced by this since the justification for including the strata could be that the units respond differently once they get treatment, not that there is a difference in the propensity to get treatment in the different strata. True, sometimes the problem is that different groups have different probabilities of receiving treatment, but not always? I may be mistaken here, but I do not want to just accept it right away. The second is more likely to be false: One often suspects that different strata respond differently to the same treatment. If this is captured by the model, it does not creat a problem, but if one uses a linear regression and there is non-linear heterogeneity, one may end up with very wrong results.

While accepting that this implicit weighting can create problems for regression estimates, I am slightly more uncertain both about how often the problem will occur and to what extent it will create problems when one of the conditions are not fulfilled. Morgan and Winship provide some helpful calculations showing that it is possible to get estimates that are very wrong. This is interesting, but the possibility does not demonstrate its generality.