Import dataset with NA column as an integer in rstan

sillymalagg · August 10, 2022, 5:50pm

[edit: removed boilerplate and escaped second model]

I have a dataset 53*24 where to columns are NAs. I want to import them to stan as int, because later on I have to fit a multinomial likelihood. How can I solve this? I read on the User manual I have to split it into integers arrays but I haven’t really understood how… If someone could explain it to me would be really appreciated! Cheers :)

model {
  vector[N] mu = alpha + beta * x;
  y ~ normal(mu, sigma); 
}

instead of

model{
  vector[N] mu = alpha+beta*x;
  y~normal(mu,sigma); 
}

Bob_Carpenter · August 10, 2022, 8:08pm

Two columns, maybe?

The conversion of NA values isn’t a Stan-specific question. It’s going to have to be done in R as Stan doesn’t accept NA inputs.

Given that NA means undefined, how do you want to convert to integers? If there’s an entire column of NA values, I’d suggest just dropping them.

If there really are NA values in multinomial observations, what you need to do is cut down and rescale the simplex to the categories that do exist. Marginally, that will do the right thing for values that you do observe because any subsequence of a multinomial observation is multinomial over the rescaled simplex. That is if

(a, b, c) ~ multinomial(a + b + c, theta),

then

(a, c) ~ multinomial(a + c, [theta[1], theta[3]]' / (theta[1] + theta[3]))

Why the normal models? Are those just in the boilerplate for a query that you didn’t edit out?

sillymalagg · August 16, 2022, 9:31am

Hi Bob! Thank you for the fast reply. I explain this better: I am fitting 24 years of data from a survey, each one contains a vector of 53 observation of numbers of something (that’s why they are fitted with multinomial likelihoods and that’s why they must be integers). Some years the survey wan not done, so that’s why the whole column is NA, but I cannot drop it because something unknown happened that I would like to estimate. I don’t know if this changes your answer or not, I am not sure of what you mean by “rescale the simplex to the categories that do exist”. Thank you very much!

Yes sorry,k the normal models are there because I didn’t edit the boilerplate!

Bob_Carpenter · August 16, 2022, 4:47pm

To keep it simple, suppose you have count data in 4 categories, y_t \in \mathbb{N}^4. And now suppose for time t you somehow estimate a simplex \theta_t \in \Delta^3 (i.e., \theta \in [0, \infty)^4 and \textrm{sum}(\theta) = 1) and the complete data model is

y_t \sim \textrm{categorical}(\theta_t).

Now suppose that y_t (for some t) is missing column 2, e.g., y_t = [17 \ \texttt{NA} \ 12 \ 2]. What you can do is project the simplex \theta_t down to 3 dimensions,

\phi_t = \frac{[\theta_{t, 1} \ \theta_{t,3} \ \theta_{t,4}]^{\top}}{\theta_{t,1} + \theta_{t,3} + \theta_{t,4}},

and use

[17 \ 12 \ 2] \sim \textrm{categorical}(\phi_t)

We know that this is the right marginal distribution for only observing y_{t, 1}, y_{t, 3}, y_{t, 4}.

Then if you want to impute the actual values, you have multiple options. Presumably we don’t know T_t, the total number of observations at time t. If we did, then we could just sample the unknowns (with just one, it’d be deterministic), or we could draw a Poisson with some uncertainty.

You can use this no matter how \theta_t was derived—by a simple simplex parameter or by a multi-logic regression. It still works the same way.

Topic		Replies	Views
Specifying input data as array of integers from R to Stan General rstan	6	462	February 9, 2023
How to perform basic operations in rstan? RStan	7	903	March 4, 2022
Error about NAs in data General	8	1451	April 12, 2020
Stan Modeling	3	321	August 3, 2022
Importing dataset with missing variables Modeling specification , missing-data	1	157	April 13, 2024

Import dataset with NA column as an integer in rstan

Related Topics