Priors for joint distribution set on marginal probabilities

Hi,

I have a model with a joint distribution over a number of discrete random variables and I want to specify my priors over this distribution using the marginal probability of each of the variables separately. My prior beliefs about this model are easier to think of that way. For example, for simplicity imagine by joint probability was over two variables

p(r_d,r_m)

where r_d \in \{r^1_d, r^2_d\} and r_m \in \{r_m^1,r_m^2, r_m^3\}. Now I can easily imagine how to set priors on p(r_d) and p(r_m) but not p(r_d,r_m).

At first, I thought why not use a change of variables kind of like that rough code below, but then I realized I can’t have such a transformation from M\times N to M + N.

data {
  int N_r;
  int N_m;

  matrix[N_r * N_m, N_r - 1] M_r;
  matrix[N_r * N_m, N_m - 1] M_m;
}
parameters {
  simplex[N_r * N_m] joint_prob;
}
transformed parameters {
  vector[N_r - 1] alpha_d = logit(M_r * joint_prob);
  vector[N_m - 1] alpha_m = logit(M_m * joint_prob);
}
model {
  // Set some priors on alpha_r and alpha_m
  alpha_d ~ ...
  alpha_m ~ ...
}

Does anyone have any suggestions how I can specify such a prior for a joint distribution using my knowledge of the marginal probabilities?

If this is a prior distribution than if you assume they’re independent you just have p(r_m) p(r_d) = p(r_d, r_m).

If you assume they’re not independent, then I would start with whatever that not-independence assumption looks like and maybe a small problem. Even with 2x3, if you know the marginals that’s 5 equations and 6 unknowns, so the assumption only needs to provide one equation or something.

2 Likes

I hadn’t given much thought to modeling the dependence directly. I’ll have to mull over that.

I was also thinking of setting priors for p(r_m) and p(r_d | r_m) where the later would have the same priors for all configurations of r_m.

data {
  int N_r;
  int N_m;

  matrix[N_r * N_m, N_r - 1] M_r;
  matrix[N_r * N_m, N_m - 1] M_m;
}
parameters {
  vector[N_r - 1] alpha_d[N_m];
  vector[N_m - 1] alpha_m;
}

transformed parameters {
  simplex[N_r * N_m] joint_prob;

  {
    int joint_prob_pos = 1;

     for (m_index in 1:N_m) {
       int joint_prob_end = joint_prob_pos + N_m - 1;
       joint_prob[joint_prob_pos:joint_prob_end] = inv_logit(alpha_m) .* inv_logit(alpha_d[m_index]);
       joint_prob_pos = joint_prob_end + 1;
     }
  }
}
model {
  // Set some priors on alpha_r and alpha_m
  alpha_d ~ multi_normal(rep_vector(0, N_d), diag_marix(rep_vector(1, N_d)));
  alpha_m ~ normal(0, 1);
}
1 Like

Yeah if p(r_d | r_m) is the same for all r_m, then that’s the same as independent priors (in this case the prior on r_d is not a function of r_m, so you could just drop the conditioning and write p(r_d))

Forgive me if I’m off-base here…
I find the code in the OP difficult to understand. Maybe I’m just being dense, but for example M_r and joint_prob appear to have the wrong dimensionality for M_r * joint_prob to be valid.

What I’m trying to discern (without immediate success) is whether @karimn’s question is really about the simplex constraint of joint_prob. That is, how do we specify a distribution on a simplex with a given set of marginal probabilities? If this is really the question then I think the answer is subtler than what @bbbales2 is suggesting. In this case, the real question might be how to write down a dirichlet distribution with a given set of marginal probabilities.

Am I on the right track here, or just adding noise?

1 Like

Sorry, yes, I switched the rows and columns in the matrices.

You are correct when you say that my question is about how to “specify a distribution on a simplex with using marginal probabilities”. I also kind of messed things up by using the logit function when really what I mean to use is the inverse of a softmax function (which is a whole other mess). I intentionally want to use a softmax function with unbounded parameters (the \alpha) rather than Dirichlet distribution (makes it easier to make this hierarchical). This approach is a deadend anyway but I wanted to share how I started to think of this.

Ben is right that this would be much simpler if the joint distribution is over independent variables. However, I can’t make that assumption even if my priors are the same over p(r_d|r_m) for all r_m. That’s why in my second bit of code (again please forgive my use of inv_logit here, I’ve corrected it below) I have \alpha_d as an array of vectors (one for each value of the r_m) that we will fit to the data. This is the only workable way I think of doing this. Please let me know if either you think this approach is wrong or if there is another easier way to model this.

data {
  int N_r;
  int N_m;

  matrix[N_r * N_m, N_r - 1] M_r;
  matrix[N_r * N_m, N_m - 1] M_m;
}
parameters {
  vector[N_r - 1] alpha_d[N_m];
  vector[N_m - 1] alpha_m;
}

transformed parameters {
  simplex[N_r * N_m] joint_prob;

  {
    int joint_prob_pos = 1;
    vector[N_m] prob_m = softmax(alpha_m);

     for (m_index in 1:N_m) {
       int joint_prob_end = joint_prob_pos + N_m - 1;
       joint_prob[joint_prob_pos:joint_prob_end] = prob_m .* softmax(alpha_d[m_index]);
       joint_prob_pos = joint_prob_end + 1;
     }
  }
}
model {
  // Set some priors on alpha_r and alpha_m
  alpha_d ~ multi_normal(rep_vector(0, N_d), diag_marix(rep_vector(1, N_d)));
  alpha_m ~ normal(0, 1);
}

I don’t have any brilliant suggestions, but you might find this helpful (as well as a few replies up-thread from there). (The title of the thread seems off-topic, but this is way downthread).

1 Like

Yeah, this is a good example of why it’s often best to reason about the prior as a probability distribution over the entire model configuration space.

Even ignoring the simplex constraint (concepts like independence are well-defined only on product spaces) a multivariate distribution is not completely specified it’s marginals – in general an infinite number of joint distributions \pi(x_{1}, \ldots, x_{K}) will share the marginals \pi(x_{k}). In other words you need to specify more information to fix the joint distribution.

On the other hand when working with a family of convenient, multivariate probability density functions there’s no guarantee that any element of that family will have the desired marginals and one may have to compromise to build a practical prior model.

Introducing a constraint also complicates matters. In this case not every set of marginals will be compatible with the constraint – for example if x_{1} + x_{2} = 1 then \pi(x_{1}) = 0.8 and \pi(x_{2}) = 0.8 is probably going to be impossible to satisfy – in which case trying to construct a compatible joint distribution will be doomed from the beginning.

That said the Dirichlet does have some convenient properties. Given the parameters \{\alpha_{1}, \ldots, \alpha_{K} \} the marginal for any single component is given by a Beta density function with parameters \text{Beta}(\alpha_{k}, \sum_{k' \ne k} \alpha_{k'} ). If your knowledge about the marginals can approximately fix those two parameters for each k – for example with mean and variance conditions – then you end up with a system of equations for all of the \alpha_{k}. For general marginals that system probably won’t have a solution but you can use it as a starting point to iteratively weaken the constraints on the \alpha_{k} until the marginals are consistent with your domain expertise and with each other.

1 Like