Lit Review - Dawid (2008) Beware of the DAG!

|

The author looks at the current causal inference/discovery methods that Pearl and Sprite from a philosophical perspective.

One thing quite interesting that the author mentioned is the difference between “seeing” and “doing” that Pearl brought up in his book Causality. “Seeing” and “doing” are definitely different, but from a philosphical standpoint, you don’t know what a system will behave until you “do” something about it, and infering the effects of “doing” from obversational “seeing” is questionable. Pearl’s approach trying to infer “doing” from “seeing” might need more justification, or strong assumptions has to be made.

The author also mentioned that even though causal discovery algorithms developed to automized the process of discovering causal connections, they are not automatic due to the fact that assumptions and results has to be clearly justified using domain knowledge.

Written with StackEdit.

Study Note - Causal Inference in Statistics A Primer

|

This post is a review of the book: Causal Inference in Statistics: A Primer (Pearl, 2016)

Under Pearl’s framework, causal inference for intervention can be carried out using the do calculas. For example, if we want to measure the average causal effect (ACE), we usually want to estimate

[P(Y=1 do(X=1))-P(Y=1 do(X=0))]

Therefore, we would like to estimate the distribution

[P(Y do(X=x))]

\(P(Y\|do(X=x))\) is a manipulated model, which removes all arrows pointed $X$ from its paranets or stratified over the variables that meets the back-door criteria, i.e., \(P(Y=y \| do(X)) = \sum_{z}P(Y=y \| X=x, Z=z)P(Z=z)\). This formula is derived using the unchanged conditional probability after the model is manipulated from the original one, therefore the conditional distribution can be directly estimated from the data. Therefore, the causal effect can be calculated using the observational data.

The front-door criteria is also another way of calculating the causal effect of $X$ on $Y$ given a mediating variable $Z$, which does not depend on any confounding variables. Using the back-door criteria, we can estimate the causal effect of $Z$ on $Y$. We can also estimate the causal effect of $X$ on $Z$ directly because we know that it is a mediating variable and does not depend of other variables. When we aggregate the causal effect of $X$ on $Z$ and $Z$ on $Y$, then we get the causal effect of $X$ on $Y$.

The graphical models also allow us to use the trancated product rule to calculate the manipulated model after intervention. For example:

[P(z, y do(x))=P_m(z)P_m(y x,z)=P(z)P(y x,z)]
and $$P(y do(x)) = \sum_zP(z)P(y x,z)$$
The adjustment formula is straight forward to use while estimating causal effects, however, the adjustment procedure might have practical issues, as we adjust for many variables in $Z$, the data within each $Z=z$ cell might be too small to reliably estimate the conditional probability. The book also talks about another, more subtle procedure that overcomes the practical difficulties of adjustment - inverse probability weighing. If $P(X=x Z=z)$ is available to us, we can use it to generate artificial samples that act as though they were drawn from the postintervention probability P_m, rather than $P(x, y, z)$.

MCMC - Gibbs Sampling

|

Study Notes for Coursera Course: Bayesian Statistics: Techniques and Models by Matthew Heiner.

Compare to Metropolis Hasting (MH) that samples for a single parameter, Gibbs Sampling (GS) can sample for multiple parameters, and it does this one by one for each parameter. The important derivation to keep in mind is the following:

[p(\theta, \phi y) \propto g(\theta, \phi) \    
p(\theta, \phi y) = p(\phi y)p(\theta \phi, y)\
p(\theta \phi, y)\propto p(\theta, \phi y) \propto g(\theta, \phi) \  
p(\phi \theta, y)\propto p(\theta, \phi y) \propto g(\theta, \phi)]  

Sampling for one parameter at a time when treating the other parameters as constants.

Written with StackEdit.

MCMC - Metropolis Hasting

|

Study Notes for Coursera Course: Bayesian Statistics: Techniques and Models by Matthew Heiner.

Bayesian statistics allows us to calculate the posterior distribution using the likelihood function and the prior distribution. The posterior distribution is proportional to the likelihood function times the prior formulated by the following equation:

[p(\theta) \propto g(\theta)]

Where $p(\theta)$ is the posterior and $g(\theta)$ is the unnormalized likelihood x prior. The reason why we need algorithm like the Metropolis Hasting is that in more sophisticated settings, the denominator/normalizer is difficult to integrate, but we still want to be able to estimate the statistics of the posterior distribution, e.g., mean and variance.

The Metropolis Hasting algorithm uses a proposal distribution, and draw samples $\theta^\prime$ from this proposal distribution:

[\theta^\prime \sim q(\theta^\prime \theta_{i-1})]

This distribution shows that the drawing is a Markov chain, the probability of drawing a specific $\theta^\prime$ given the previous draw $\theta_{i-1}$. Based on some accept/reject criteria, all the accepted $\theta^\prime$s gives us simulated samples for the posterior distribution (? or just give us values around the mean with a normal distribution?). These samples converge to mean values that represent the mean for the posterior distribution.

The proposal distribution $q$ is not the actual posterior distribution of $p$. The accept/reject criteria are to correct for this error. The $q$ distribution usually used a normal distribution with the mean that is the $\theta^\prime$ previously drawn and a chosen standard deviation. The benefit of using a normal distribution is that in the accept/reject criteria the formula is simplified because normal distribution is symmetrical. Although one has to consider that the best $q$ should be similar to $p$. The lesson didn’t mention anything about the best strategy of choosing $q$ (?), but from the simulation code, it seems like whatever distribution we choose for $q$, after a large number of iterations, the simulated $\theta^\prime$ centers around a mean value with some variance.

There are also two different types of Metropolis Hasting (MH) algorithm - independent MH and random walk MH. Independent MH draws from a fixed normal/uniform distribution that does not depend of the previous draw. Random walk MH use the previous draw $\theta^\prime$ as the proposal distribution. I am assuming that the random walk MH converges faster? Not sure what the benefits of each of them.

Just found that for independent MH, it is better to choose a proposal distribution that is very close to the actual posterior distribution. Though, I am not sure how because isn’t that the original purpose of using the MH method because we want to approximate the posterior distribution through simulations? It seems like using random walk MH, we can choose some normal distribution to begin with and over time it will get to the target distribution. I think random walk MH is more useful, although I guess it takes more time to run.

The MH is able to converge to $p$ beacause it is utilizing the information from $g$. $g$ is known and derivable from the data and prior distribution we have, although it is hard to get the probability distribution for $p$ because the denorminator might be hard to integrate and therefore the posterior $p$ has no close form solution. Markov Chain Monte Carlo (MCMC) sampling methods like MH and Gibbs Sampling solved this problem by approximating the posterior distribution $p$ by using the information from $g$ through some accept/reject criteria.

Written with StackEdit.

Testing Null Hypothesis for No Treatment Effects

|

Study Notes for Cousera Course: Causal Inference I) by Michael Sobel.

If we have an observed treatment $Z$ and a test statistic $t(Z, y)$. If we calculate $t(Z, y)$ for all other treatment assignments, we get a distribution of $t(Z, y)$. When we calculate $t = t(Z, y)$ using the observed $Z$ and $y$, we are assuming the Null Hypothesis, $H_0$, that there is no treatment effect. We have the distribution for $t(Z, y)$, and we know under $H_0$ the statistic we have is t. And we calculate the probability that $t(Z, y)$ is greater than t. In the example, the statistics $t(Z, y)$ used $\bar{Y}_t - \bar{Y}_c$.

$\bar{Y}_t - \bar{Y}_c$ is the average effect under the treated subjects and the untreated subjects. When we calculate this value for all possible assignments $Z$, then we get a distribution for $\bar{Y}_t - \bar{Y}_c$. An important question is that what does this probability distribution represent?

Should I consider this distribution as the $H_0$ distribution? Assuming there is no treatment effect? However, the observed $Y$ is associated with a particular $Z$, so what does it mean by randomizing $Z_1$ and then calcuate $\bar{Y}_t - \bar{Y}_c$?

Watching the video again - under $H_0$ the potential outcomes are the same for each subject $y_i(0) = y_i(1)$ for all assignments in $\Omega$. In randomized experiments, all assignment vectors in $\Omega$ are equally likely. I am guessing that under the $H_0$, if we take $\bar{Y}_t - \bar{Y}_c$ as the statistics, then it should be some kind of a distribution around 0, because $H_0$ is making the assumption that there is no treatment effect. Then after we have a bunch of observations, then we can use these observations to get a statistics to do the hypothesis testing. Although, the question now is how do we know the distribution/statistics under $H_0$? If we think about hypothesis testing for the mean with known variance under normal distribution, we construct $H_0:\mu=\mu_0$. Then after we collect sample data and we have $\bar{x}$, we find the cumulated probability of $\mu \geq \bar{x}$ from the normal distribution of $\mu$ with mean $\mu_0$ and some fixed variance. In this $H_0$ we need to construct something like $H_0: \bar{Y}_t - \bar{Y}_c=0$. The first thing is that what is the statistics we are using in here? In the normal case, we know that the data is distributed normally, or we construct some statistics that distributed in a certain way. We do need to assume some kind of distribution a priori though, no?

Just studying the Fisher and Bristol tea experiment example, for 8 cups of tea with 4 put in milk first and 4 put in tea first, the probability of each possible outcome is $\binom{8}{4} = \frac{8!}{4!(8-4)!}$. But still, what is the the $H_0$? Bristol knows that there are 4 put in milk first and 4 put in tea first, so she is going to provide answers with 4/4 format but just answer them in different places for all the 8 cups. So the statistcs that we are interested in would be the number of correct answers $t$. Now we want to know the probability of $t\geq t(obs)$, where $t(obs)$ is the observed value, the answer that Bristol provides. Now, the next question is still, how do we know the distribution of $t$? I guess it is because we have $\binom{8}{4}$ this many equally likely events, and we also know the correct answer. From this we can calculate the probability of Bristol answering correctly for 1, 2, 3, 4, …, 8. This is the distribution for $t$. Now we know the distribution for $t$, we can calculate $Pr(t\geq t(obs))$ to see the probability of Bristol getting the correct answer by just guessing. If the probability is very small, then we say that it is very unlikely that Bristol is just guessing. With some significant p-values, we can reject $H_0$.

However, that’s when we know the correct answer, and then we can construct the probability distribution under $H_0$. In this case, when we have equally likely treatments, can we do the same thing? Let’s say that the treatments are randomly assigned, then we calculated $Pr(t\geq t(obs))$, where $t(obs)$ is number of treated subjects. This is not what we want though. We want a statistics that relate to the effects, and in the example, $\bar{Y}_t - \bar{Y}_c$ is selected. I guess that if we randomly shuffle the treatment assignment, we have all the possible events of treatment assignments. I am not quite understanding this now. If we randomly shuffle the treatment, in reality, should I get different treatment effects?

Consider that know all the possible treatment assignments and each of them are equally likely to happen, then each assignment have a test statistics $\bar{Y}_t - \bar{Y}_c$ calculated from one single observation $y=(1,3,4,6)$ and treatment $Z=(0,0,1,1)$. I think the rationale behine this is that, if we observed $y=(1,3,4,5)$, under $H_0$ that there is no treatment effect, it won’t matter the way you assign treatments. $\bar{Y}_t - \bar{Y}_c$ is assumed to be 0 for all assignments. I think this example is just to demostrate a particular test statistic that we can use, but not demonstrating a valid hypothesis testing for no treatment effects (?).

In the later section of the video, a more valid $H_0$ is given that $H_0: \tau=1, H_1: \tau>1$ where $y_i(1) - y_i(0)=\tau$ and now everything start to make sense. Under the $H_0$, from the observed data we have $(y_1(0) = 1, y_2(0)=3, y_3(0)=3, y_4(0)=5)$ and $(y_1(1) = 2, y_2(1)=4, y_3(1)=4, y_4(1)=6)$ given treatment assignment $Z = (1,1,0,0)$. Gonna stop here for now and move on. This is still not very clear to me how to calculate the distriobution of $y_i(1) - y_i(0)$. I might come back to it later.

Written with StackEdit.