Priors for joint distribution set on marginal probabilities

Hi,

I have a model with a joint distribution over a number of discrete random variables and I want to specify my priors over this distribution using the marginal probability of each of the variables separately. My prior beliefs about this model are easier to think of that way. For example, for simplicity imagine by joint probability was over two variables

p(r_d,r_m)

where r_d \in \{r^1_d, r^2_d\} and r_m \in \{r_m^1,r_m^2, r_m^3\}. Now I can easily imagine how to set priors on p(r_d) and p(r_m) but not p(r_d,r_m).

At first, I thought why not use a change of variables kind of like that rough code below, but then I realized I can’t have such a transformation from M\times N to M + N.

data {
  int N_r;
  int N_m;

  matrix[N_r * N_m, N_r - 1] M_r;
  matrix[N_r * N_m, N_m - 1] M_m;
}
parameters {
  simplex[N_r * N_m] joint_prob;
}
transformed parameters {
  vector[N_r - 1] alpha_d = logit(M_r * joint_prob);
  vector[N_m - 1] alpha_m = logit(M_m * joint_prob);
}
model {
  // Set some priors on alpha_r and alpha_m
  alpha_d ~ ...
  alpha_m ~ ...
}

Does anyone have any suggestions how I can specify such a prior for a joint distribution using my knowledge of the marginal probabilities?

1 Like

If this is a prior distribution than if you assume they’re independent you just have p(r_m) p(r_d) = p(r_d, r_m).

If you assume they’re not independent, then I would start with whatever that not-independence assumption looks like and maybe a small problem. Even with 2x3, if you know the marginals that’s 5 equations and 6 unknowns, so the assumption only needs to provide one equation or something.

2 Likes

I hadn’t given much thought to modeling the dependence directly. I’ll have to mull over that.

I was also thinking of setting priors for p(r_m) and p(r_d | r_m) where the later would have the same priors for all configurations of r_m.

data {
  int N_r;
  int N_m;

  matrix[N_r * N_m, N_r - 1] M_r;
  matrix[N_r * N_m, N_m - 1] M_m;
}
parameters {
  vector[N_r - 1] alpha_d[N_m];
  vector[N_m - 1] alpha_m;
}

transformed parameters {
  simplex[N_r * N_m] joint_prob;

  {
    int joint_prob_pos = 1;

     for (m_index in 1:N_m) {
       int joint_prob_end = joint_prob_pos + N_m - 1;
       joint_prob[joint_prob_pos:joint_prob_end] = inv_logit(alpha_m) .* inv_logit(alpha_d[m_index]);
       joint_prob_pos = joint_prob_end + 1;
     }
  }
}
model {
  // Set some priors on alpha_r and alpha_m
  alpha_d ~ multi_normal(rep_vector(0, N_d), diag_marix(rep_vector(1, N_d)));
  alpha_m ~ normal(0, 1);
}
1 Like

Yeah if p(r_d | r_m) is the same for all r_m, then that’s the same as independent priors (in this case the prior on r_d is not a function of r_m, so you could just drop the conditioning and write p(r_d))

Forgive me if I’m off-base here…
I find the code in the OP difficult to understand. Maybe I’m just being dense, but for example M_r and joint_prob appear to have the wrong dimensionality for M_r * joint_prob to be valid.

What I’m trying to discern (without immediate success) is whether @karimn’s question is really about the simplex constraint of joint_prob. That is, how do we specify a distribution on a simplex with a given set of marginal probabilities? If this is really the question then I think the answer is subtler than what @bbbales2 is suggesting. In this case, the real question might be how to write down a dirichlet distribution with a given set of marginal probabilities.

Am I on the right track here, or just adding noise?

1 Like

Sorry, yes, I switched the rows and columns in the matrices.

You are correct when you say that my question is about how to “specify a distribution on a simplex with using marginal probabilities”. I also kind of messed things up by using the logit function when really what I mean to use is the inverse of a softmax function (which is a whole other mess). I intentionally want to use a softmax function with unbounded parameters (the \alpha) rather than Dirichlet distribution (makes it easier to make this hierarchical). This approach is a deadend anyway but I wanted to share how I started to think of this.

Ben is right that this would be much simpler if the joint distribution is over independent variables. However, I can’t make that assumption even if my priors are the same over p(r_d|r_m) for all r_m. That’s why in my second bit of code (again please forgive my use of inv_logit here, I’ve corrected it below) I have \alpha_d as an array of vectors (one for each value of the r_m) that we will fit to the data. This is the only workable way I think of doing this. Please let me know if either you think this approach is wrong or if there is another easier way to model this.

data {
  int N_r;
  int N_m;

  matrix[N_r * N_m, N_r - 1] M_r;
  matrix[N_r * N_m, N_m - 1] M_m;
}
parameters {
  vector[N_r - 1] alpha_d[N_m];
  vector[N_m - 1] alpha_m;
}

transformed parameters {
  simplex[N_r * N_m] joint_prob;

  {
    int joint_prob_pos = 1;
    vector[N_m] prob_m = softmax(alpha_m);

     for (m_index in 1:N_m) {
       int joint_prob_end = joint_prob_pos + N_m - 1;
       joint_prob[joint_prob_pos:joint_prob_end] = prob_m .* softmax(alpha_d[m_index]);
       joint_prob_pos = joint_prob_end + 1;
     }
  }
}
model {
  // Set some priors on alpha_r and alpha_m
  alpha_d ~ multi_normal(rep_vector(0, N_d), diag_marix(rep_vector(1, N_d)));
  alpha_m ~ normal(0, 1);
}

I don’t have any brilliant suggestions, but you might find this helpful (as well as a few replies up-thread from there). (The title of the thread seems off-topic, but this is way downthread).

1 Like

Yeah, this is a good example of why it’s often best to reason about the prior as a probability distribution over the entire model configuration space.

Even ignoring the simplex constraint (concepts like independence are well-defined only on product spaces) a multivariate distribution is not completely specified it’s marginals – in general an infinite number of joint distributions \pi(x_{1}, \ldots, x_{K}) will share the marginals \pi(x_{k}). In other words you need to specify more information to fix the joint distribution.

On the other hand when working with a family of convenient, multivariate probability density functions there’s no guarantee that any element of that family will have the desired marginals and one may have to compromise to build a practical prior model.

Introducing a constraint also complicates matters. In this case not every set of marginals will be compatible with the constraint – for example if x_{1} + x_{2} = 1 then \pi(x_{1}) = 0.8 and \pi(x_{2}) = 0.8 is probably going to be impossible to satisfy – in which case trying to construct a compatible joint distribution will be doomed from the beginning.

That said the Dirichlet does have some convenient properties. Given the parameters \{\alpha_{1}, \ldots, \alpha_{K} \} the marginal for any single component is given by a Beta density function with parameters \text{Beta}(\alpha_{k}, \sum_{k' \ne k} \alpha_{k'} ). If your knowledge about the marginals can approximately fix those two parameters for each k – for example with mean and variance conditions – then you end up with a system of equations for all of the \alpha_{k}. For general marginals that system probably won’t have a solution but you can use it as a starting point to iteratively weaken the constraints on the \alpha_{k} until the marginals are consistent with your domain expertise and with each other.

3 Likes

I’m having trouble thinking about this. Can you give more details about it?

Is it that we have measurements of these discrete things? Like we have N output data points and for each one we have an r_d and an r_m. Or is this something we’re going to try to integrate out?

And the marginal distribution thing – is it that we’re going to compute these marginal distributions from some regression? Or do we already know the marginal distributions from somewhere and it’s a thing we want to just assume?

Michael and Ben, sorry for the delayed response.

First, let me try to answer Ben’s question. I’m trying to model what Judea Pearl in his book Causality describes as bounded estimation of treatment effects (chapter 8). Basically, each observation has a vector of latent types. In my simple model here, there are two type variables r_d and r_m, and each is a discrete random variable. For example, in my model,

r_d \in \{\textrm{offered-on-assignment, never-offered}\} \\ r_m \in \{\textrm{never-migrate, complier-migrate, always-migrate}\}.

There are three binary observed variables in this model (Z, D, M), where Z is a random assignment in an experiment. From the observed data, I usually cannot determine exactly what types each observation belongs to. For example, if an observation is assigned to control, Z = 0, and I observed D = 0 I can never determine what r_d is for that observation, but if D = 1, I know for sure r_d = \textrm{offered-on-assignment}. And similarly, for M. I want to calculate the posterior distribution of the joint p(r_d, r_m) so I can calculate the posterior distribution of the bounds of causal estimands such as E(M_{d=1}) (here the subscript indicates an intervention such that everyone is forced D=1). Another complication is that these types are likely correlated (endogenous). So, for each observation, I have a mixture of possible type combinations and that’s all I have to update the joint probability. There is going to be an infinite number of joint distributions that conform with the data, but I want to know what this distribution of the joint distribution looks like and how it bounds my estimands.

Michael, maybe I don’t correctly understand, but it sounds like the marginal Beta distribution that we get from the Dirichlet distribution that you mentioned for some x_k would the marginal probability of any particular pair of types of (r_d, r_m). So the Beta distribution that you mentioned would be on the probability, for example, that (r_d = \textrm{never-offered}, r_m = \textrm{never-migrate}). What my prior knowledge is actually on is the relative proportions of types within each type variable: so I have a generate senses that p(r_d = \textrm{never-offered}) should be small and that p(r_m = \textrm{never-migrate}) should be higher than the other r_m types.

What I have been doing is tweaking the joint distribution parameters and then using prior prediction to look at the marginal probabilities p(r_d) and p(r_m) to see if they conform with what I want to model. I just wasn’t sure that my raising and lowering parameters for the joint probability isn’t messing things up in ways I don’t see when I just look at prior predicted marginal probabilities. What I thought would be better was modeling the priors p(r_d) and p(r_m | r_d) separately. This would allow the model to vary the probabilities of r_m conditional on r_d, but when I’m setting the prior I just use the same prior for all p(r_m | r_d). I.e., I set the probability for p(r_m = \textrm{never-migrate} | r_d = \textrm{offered-on-assignment}) and p(r_m = \textrm{never-migrate} | r_d = \textrm{never-offered}) equally high.

I should keep my mouth shut, but this is the problem with most of the computer science approaches to causal inference. Without assumptions one cannot limit the infinite possibilities into meaningful insights using finite observations. Any meaningful causal inference requires a sufficiently rigid model of the actual data generating process.

No, the Beta densities are for the marginal probabilities of each element relative to all of the other elements, hence the sum over all of the other prior parameters for those elements. Domain expertise on the marginal proportion of element k can be used to inform an approximate Beta density which then informs \alpha_k and \sum_{k' \ne k} \alpha_{k'}. Do this for enough elements and in theory you can get a system of linear equations that you can solve for all of the \alpha_k, but if the Beta densities are consistent with each other than that system won’t actually have any solutions.

Michael,

Thank you for your view on this. I’m glad you spoke up. It gives me pause and makes me want to think more carefully about the assumptions I need.

With a structural graphical model like this, there are a lot of hard assumptions already being made on the possible “types” of individuals that can exist and how variables respond to their inputs. But ultimately, the endogeneity of some of these variables leaves us with no credible assumptions about how their latent components are correlated. Hence, this bounded estimation approach. However, I’m starting to feel that I need to rethink some of this.

1 Like