Simplex Regression

Hello.

I’m trying to predict a probability distribution from a categorical predictor. It seems like my model is not only clunky, but also wrong, as I am getting a few error messages that I can’t seem to fix. Hope someone can help. Here’s the error message:

SAMPLING FOR MODEL 'stan_code' NOW (CHAIN 1).
Unrecoverable error evaluating the log probability at the initial value.
Exception: dirichlet_lpmf: prior sample sizes has dimension = 6, expecting dimension = 48989; a function was called with arguments of different scalar, array, vector, or matrix types, and they were not consistently sized;  all arguments must be scalars or multidimensional values of the same shape.  (in 'model9ddc3e2dbeb8_stan_code' at line 36)

[1] "Error in sampler$call_sampler(args_list[[i]]) : "                                                                                                                                                                                                                                                                                                   
[2] "  Exception: dirichlet_lpmf: prior sample sizes has dimension = 6, expecting dimension = 48989; a function was called with arguments of different scalar, array, vector, or matrix types, and they were not consistently sized;  all arguments must be scalars or multidimensional values of the same shape.  (in 'model9ddc3e2dbeb8_stan_code' at line 36)"
error occurred during calling the sampler; sampling not done

And here’s the relevant part of my R code:

design_matrix = model.matrix(~ Typ, data = dat)
distributions = t(simplify2array(dat$distribution_vectors))
data_list = list(Y = distributions,
            X = design_matrix,
            N = nrow(dat),
            ncat = 6,
            input_dim = 6)
# fit model
fit <- stan(file = 'stan_code.stan', data = data_list, 
            iter = 1000, chains = 1)

And finally the stan script:

data { 
  int<lower=1> N;  // total number of observations 
  int<lower=2> ncat;  // number of categories
  int<lower=2> input_dim; // number of predictor levels
  matrix[N,input_dim] X; // predictor design matrix
  matrix[N,ncat] Y;  // response variable (simplex?)
}
transformed data{
  matrix[ncat,N] Y_transposed;
  for (category in 1:ncat){
    for (n in 1:N){
      Y_transposed[category,n] = Y[n,category];
  }
  }
}

parameters {
  matrix[ncat,input_dim] weights; // coefficients
  vector[ncat] y_transposed;
}

model {
  // priors
  for (out_category in 1:ncat) {
     for (in_category in 1:input_dim) {
      weights[out_category,in_category] ~ normal(0,5);
  }
  }
  // likelihood 
  for (n in 1:N) {
    vector[ncat] logits;
    for (m in 1:ncat){
      logits[m] = X[n] * transpose(weights[m]);
  }
    // logits = X[n] * weights; // can't multiply matrix with vector
    Y[,n] ~ dirichlet(softmax(logits)); 
  }
}

Y has N rows and ncat columns, but in the sampling statement, Y[,n] ~ dirichlet(softmax(logits)), it looks like the column index n ranges from 1 to N rather than 1 to ncat. I’m pretty sure that’s a mistake.

Yes, you’re right, thanks. It was supposed to be Y_transposed[,n] instead.

Now I have new problem, which from googling it a bit is probably due to underflow on the softmax statement. How can I solve that problem? And could someone explain what exactly is going wrong? And what is being initialized between -2 and 2, I’m setting the priors explicitly to gaussians, no?

Rejecting initial value:
  Log probability evaluates to log(0), i.e. negative infinity.
  Stan can't start sampling from this initial value.

Initialization between (-2, 2) failed after 100 attempts.
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done

See section 8.2 of the Stan reference manual, especially the subsection “Random Initial Values.”

Well, that certainly was not helpful. As I said, I have initialized all the parameters (which is just the weights) to gaussians, so the error message is about as useful as that part of the stan manual, which I had already looked at.

I found the error by now, even though I don’t really understand it. It occurs when there is even just a single element in any of the datapoints’ probability distributions that has a probability of 0. I suppose this somehow leads to underflow. I would be interested in an explanation, but in any case, this runs now and is solved.

That’s not the kind of initialization that the error message was mentioning. It’s referring to setting the initial values of parameters to actual real numbers, not just setting the prior distributions of the parameters. See the init parameter in the stan() function of RStan.

Yes, but the problem isn’t actually with initialization, it’s with data. if you have a data vector that you’re giving a likelihood as dirichlet, and you have data with zero in any entry, you’ll get that there’s no density, because the density of the dirichlet is proportional to a product of the x values raised to some power, so as soon as one of the x values is 0… the whole product is zero, and the log is -Inf.

Then, it doesn’t matter what Stan does for initial values, it still can’t get a valid posterior density.

I found this out the hard way…

jjramsey,
yes, sorry. I actually realized that before I even posted (and then went on to investigate), but somehow ended up not changing the answer. But then again, it did not help.

dlakelan,
yes, thanks for pointing that out. Now that you said it, it’s obvious to see in the density formula. That does make me wonder, though, how Latent Dirichlet Analysis results in sparse distributions (that’s actually the distributions I’m looking at, over topics from LDA). I guess there’s just a cutoff and it’s not a result of the actually sampling process.

I marked this as the correct solution. @kuchenrolle, see the section on support in the box to the right of:

The part where it says the support requires x[i] > 0 is the relevant part. The Dirichlet isn’t defined for simplexes with zero values in some dimensions.

If you have the data, you want to just fit a Dirichlet-multinomial on that. If you’re reducing the data first to a simplex and trying to fit that, you’ll run into these kinds of problems because you’re not establishing a coherent generative model for the data.

Well, I just smoothed the data, s.t. each topic has some minimal probability, which solved the problem. How would I have told Stan to use a multinomial dirichlet, though?

You need the original count data which it sounds like you’re smoothing. So rather than taking the data and smoothing into a simplex, you’d use a Dirichlet-multinomial — that is, draw a parameter theta[i] from the Dirichlet for each observation i and then model the count data y[i] as Dirichlet(theta[i]).

Hello,

can you please share the final working model you achieved?

I am trying to do a similar model.

Thanks a lot.