Import dataset with NA column as an integer in rstan

[edit: removed boilerplate and escaped second model]

I have a dataset 53*24 where to columns are NAs. I want to import them to stan as int, because later on I have to fit a multinomial likelihood. How can I solve this? I read on the User manual I have to split it into integers arrays but I haven’t really understood how… If someone could explain it to me would be really appreciated! Cheers :)

model {
  vector[N] mu = alpha + beta * x;
  y ~ normal(mu, sigma); 

instead of

  vector[N] mu = alpha+beta*x;

Two columns, maybe?

The conversion of NA values isn’t a Stan-specific question. It’s going to have to be done in R as Stan doesn’t accept NA inputs.

Given that NA means undefined, how do you want to convert to integers? If there’s an entire column of NA values, I’d suggest just dropping them.

If there really are NA values in multinomial observations, what you need to do is cut down and rescale the simplex to the categories that do exist. Marginally, that will do the right thing for values that you do observe because any subsequence of a multinomial observation is multinomial over the rescaled simplex. That is if

(a, b, c) ~ multinomial(a + b + c, theta),


(a, c) ~ multinomial(a + c, [theta[1], theta[3]]' / (theta[1] + theta[3]))

Why the normal models? Are those just in the boilerplate for a query that you didn’t edit out?

Hi Bob! Thank you for the fast reply. I explain this better: I am fitting 24 years of data from a survey, each one contains a vector of 53 observation of numbers of something (that’s why they are fitted with multinomial likelihoods and that’s why they must be integers). Some years the survey wan not done, so that’s why the whole column is NA, but I cannot drop it because something unknown happened that I would like to estimate. I don’t know if this changes your answer or not, I am not sure of what you mean by “rescale the simplex to the categories that do exist”. Thank you very much!

Yes sorry,k the normal models are there because I didn’t edit the boilerplate!

To keep it simple, suppose you have count data in 4 categories, y_t \in \mathbb{N}^4. And now suppose for time t you somehow estimate a simplex \theta_t \in \Delta^3 (i.e., \theta \in [0, \infty)^4 and \textrm{sum}(\theta) = 1) and the complete data model is

y_t \sim \textrm{categorical}(\theta_t).

Now suppose that y_t (for some t) is missing column 2, e.g., y_t = [17 \ \texttt{NA} \ 12 \ 2]. What you can do is project the simplex \theta_t down to 3 dimensions,

\phi_t = \frac{[\theta_{t, 1} \ \theta_{t,3} \ \theta_{t,4}]^{\top}}{\theta_{t,1} + \theta_{t,3} + \theta_{t,4}},

and use

[17 \ 12 \ 2] \sim \textrm{categorical}(\phi_t)

We know that this is the right marginal distribution for only observing y_{t, 1}, y_{t, 3}, y_{t, 4}.

Then if you want to impute the actual values, you have multiple options. Presumably we don’t know T_t, the total number of observations at time t. If we did, then we could just sample the unknowns (with just one, it’d be deterministic), or we could draw a Poisson with some uncertainty.

You can use this no matter how \theta_t was derived—by a simple simplex parameter or by a multi-logic regression. It still works the same way.

1 Like