Testing a categorical model

I want to use this model.

data {
    int<lower=0> N;
    vector[N] u;
    real g[N];
    int h[N];
    vector[N] y;
}
parameters {
    real alpha;
    real phi;
    real upsilon;
    real kappa;
    real<lower=0> sigma1;
    real<lower=0> sigma2;
}
model {
    sigma2 ~ lognormal(0,1);
    phi ~ normal(0,1);
    upsilon ~ normal(0,1);
    kappa ~ normal(0,1);
    alpha ~ normal(0,1);
    for (n in 1:N) { g[n] ~ normal(0,1); }
    for (n in 1:N) { h[n] ~ categorical_logit(phi * g[n] + upsilon * u[n]); }
    for (n in 1:N) { y[n] ~ normal(alpha[h[n]] + kappa * u[n], sigma2); }
}

Everybody recommends to first generate data so I can see if the model is ok on assumed parameters, so I swapped the data and parameters blocks to get

data {
    int<lower=0> N;
    real alpha;
    real phi;
    real upsilon;
    real kappa;
    real<lower=0> sigma1;
    real<lower=0> sigma2;
}
parameters {
    vector[N] u;
    real g[N];
    int h[N];
    vector[N] y;
}
model {
    sigma2 ~ lognormal(0,1);
    phi ~ normal(0,1);
    upsilon ~ normal(0,1);
    kappa ~ normal(0,1);
    alpha ~ normal(0,1);
    for (n in 1:N) { g[n] ~ normal(0,1); }
    for (n in 1:N) { h[n] ~ categorical_logit(phi * g[n] + upsilon * u[n]); }
    for (n in 1:N) { y[n] ~ normal(alpha[h[n]] + kappa * u[n], sigma2); }
}

but Stan isn’t willing to generate data for me to test with.

 line 13, column 4 to column 13:
   -------------------------------------------------
    11:      vector[N] u;
    12:      real g[N];
    13:      int h[N];
             ^
    14:      vector[N] y;
    15:  }
   -------------------------------------------------

(Transformed) Parameters cannot be integers.

I tried generating the data with a linear continuous version instead

data {
    int<lower=0> N;
    real alpha;
    real phi;
    real upsilon;
    real kappa;
    real<lower=0> sigma1;
    real<lower=0> sigma2;
}
parameters {
    vector[N] u;
    real g[N];
    real h[N];
    vector[N] y;
}
model {
    sigma1 ~ lognormal(0,1);
    sigma2 ~ lognormal(0,1);
    phi ~ normal(0,1);
    upsilon ~ normal(0,1);
    kappa ~ normal(0,1);
    alpha ~ normal(0,1);
    for (n in 1:N) { g[n] ~ normal(0,1); }
    for (n in 1:N) { h[n] ~ normal(phi * g[n] + upsilon * u[n], sigma1); }
    for (n in 1:N) { y[n] ~ normal(alpha * h[n] + kappa * u[n], sigma2); }
}

which did generate some data, and then I converted the h values into ints by cutting the continuous values into bins. Then stan was able to sample from the int model but it produced wrong answers

> summarize_stan(indir("analysis.comcsv"))
13×4 DataFrame
 Row │ variable       mean             std           eltype
     │ Symbol         Float64          Float64       DataType
─────┼────────────────────────────────────────────────────────
   1 │ lp__           -2535.86          1.72457      Float64
   2 │ accept_stat__      0.918997      0.0958284    Float64
   3 │ stepsize__         0.0273539     3.12406e-17  Float64
   4 │ treedepth__        6.288         1.15948      Int64
   5 │ n_leapfrog__     103.196        37.6453       Int64
   6 │ divergent__        0.0           0.0          Int64
   7 │ energy__        2538.92          2.52855      Float64
   8 │ alpha             -0.0209184     0.0342002    Float64
   9 │ phi               -0.0857657     0.0631577    Float64
  10 │ upsilon            0.000929568   8.91259e-5   Float64
  11 │ kappa             -0.99995       0.000101321  Float64
  12 │ sigma1             2.05748       0.0459591    Float64
  13 │ sigma2             2.25148       0.047969     Float64

> trueparams
Dict{String, Real} with 7 entries:
  "sigma1"  => 1.0
  "upsilon" => 2.0
  "N"       => 1000
  "kappa"   => 1.0
  "phi"     => 0.0
  "alpha"   => -1.0
  "sigma2"  => 2.0

This whole thing was a lot of work, especially to get a wrong answer. My questions are:

  • How can I generate test data for the original model? Can Stan do it?
  • Is the original model ok? If I write code in some other language to generate data like that model, will Stan be able to sample from it correctly?

Sadly, it’s not particularly straightforward. The way to do it in general when there are integer parameters is to use the generated quantities block. But that essentially requires rewriting the code and isn’t any easier than just generating the data in R or Python or something.

You definitely do not want to try to do continuous sampling then cut. It’ll break Stan’s sampler, which assumes continuous differentiability.

I don’t understand how the categorical_logit could work given that the argument seems to be a scalar rather than a vector. And the Stan compiler agrees with me,

> cmdstan_model('temp.stan')
...
    23:      for (n in 1:N) { h[n] ~ categorical_logit(phi * g[n] + upsilon * u[n]); }
                              ^
...
Ill-typed arguments to '~' statement. No distribution 'categorical_logit' was found with the correct signature

The likelihood is also problematic in that terms like phi * g[n] are not identified, because they’re both parameters. If you divide phi by a constant and multiply all the g[n] by the same constant, the result is the same. You can identify by giving g and phi priors as you have done, but this leads to banana-shaped posteriors which can be very difficult to sample. Furthermore, the sum phi * g[n] + upsilon * u[n] can get problematic because you can add to the left-hand factor and subtract from the right to get the same answer.

When you do sort out the categorical-logic, there’s an issue with identifiability there, too, in that you can add or subtract from all the arguments and get the same answer.