Prior on matrix of ordered types in mixture model

Hi,

I’m implementing a 2-component mixture model of negative binomials for some RNA count data. In the model below, I’d like to get help with changing my priors on the parameters in the matrix positive_ordered[2] theta_arr[G,C] to be g-indexed instead of (g,c)-indexed. For example, another model I have tried with this dataset is a (1-component) GLM negative binomial; and its parameter matrix<lower=0>[G, C] theta; receives the following prior:

model {
    for (g in 1:G)
        theta[g] ~ normal(0, 1);
    ...
}

And that is what I would like to replicate in the below model.

Here is the full model:

data {
    // dimensions
    int<lower=1> N;  // N observations (RNA measurements: one per gene x patient)
    int<lower=1> G;  // G genes
    int<lower=1> S;  // S patients
    int<lower=0> C;  // C disease types - mutually exclusive. Each row here should sum to 1

    // data
    int<lower=1, upper=G> gene[N];    // gene id for each observation
    int<lower=1, upper=S> sample[N];  // sample id for each observation
    vector<lower=0, upper=1>[C] x[N]; // map each obs to each class (0:'- or ?', 1:'+')
    int<lower=0> y[N];                // count for each observation
}
transformed data {
    int sample_y[S, G]; // array (size SxG) of ints
    vector[C] sample_x[S]; // array (size S) of vectors[C]
    for (n in 1:N) {
        sample_y[sample[n], gene[n]] = y[n];
        sample_x[sample[n]] = x[n,];
    }
}
parameters {
    vector<lower=0>[G] gene_phi_1;  // overdispersion parameter per gene, component 1
    vector<lower=0>[G] gene_phi_2;  // overdispersion parameter per gene, component 2
    real<lower=0, upper=1> lambda; // mixing weight of the two components

    positive_ordered[2] theta_arr[G,C]; // array (size GxC) of ordered types containing means of component 1 and component 2
    vector[G] log_gene_base_1;     // constant intercept expression level for each gene, irrespective of cell type, component 1
    vector[G] log_gene_base_2;     // constant intercept expression level for each gene, irrespective of cell type, component 2
}

transformed parameters {
  matrix<lower=0>[G, C] theta_1; // loading factors for each gene, for each cell type. component 1
  matrix<lower=0>[G, C] theta_2; // loading factors for each gene, for each cell type. component 2
  for (g in 1:G) {
    for (c in 1:C) {
      theta_1[g,c] = theta_arr[g, c, 1];
      theta_2[g,c] = theta_arr[g, c, 2];
    }
  }
}

model {
  for (g in 1:G) {
    for (c in 1:C) {
      theta_arr[g, c] ~ normal(0, 1); // This block is where I need help
    }
  }

  lambda ~ beta(2,2);
  gene_phi_1 ~ normal(0, 1);
  gene_phi_2 ~ normal(0, 1);
  log_gene_base_1 ~ normal(0, 0.5);
  log_gene_base_2 ~ normal(0.5, 0.3);

  for (s in 1:S) {
      vector[G] log_expected_rate_1;
      vector[G] log_expected_rate_2;

      log_expected_rate_1 = log_gene_base_1 + log(theta_1*sample_x[s]);
      log_expected_rate_2 = log_gene_base_2 + log(theta_2*sample_x[s]);

      target += log_mix(lambda,
        neg_binomial_2_log_lpmf(sample_y[s] | log_expected_rate_1, gene_phi_1),
        neg_binomial_2_log_lpmf(sample_y[s] | log_expected_rate_2, gene_phi_2));

  }
}
generated quantities {
  int<lower=0> y_rep[N];
  real log_expected_rate_1[N];
  real log_expected_rate_2[N];

  for (n in 1:N) {
      log_expected_rate_1[n] = log_gene_base_1[gene[n]] + log(theta_1[gene[n], ]*x[n]);
      log_expected_rate_2[n] = log_gene_base_2[gene[n]] + log(theta_2[gene[n], ]*x[n]);

      if(bernoulli_rng(lambda))
        y_rep[n] = neg_binomial_2_log_rng(log_expected_rate_1[n], gene_phi_1[gene[n]]);
      else
        y_rep[n] = neg_binomial_2_log_rng(log_expected_rate_2[n], gene_phi_2[gene[n]]);
  }
}

One thing I’ve tried is this:

model {
  for (g in 1:G) {
      theta_1[g] ~ normal(0, 1);
      theta_2[g] ~ normal(0,1);
     ...
  }

which results in this output when the model is built:

DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    theta_1[g] ~ normal(...)
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    theta_2[g] ~ normal(...)

I’m not sure if I need to heed this warning / what I should do in response?

Any suggestions/solutions other than or in addition to this one are welcome, thanks!

You can ignore it because indexing is actually a linear function but the Stan parser cannot tell that.

2 Likes