Ways to use data in CmdStan

Lets say I have the following lin_reg.stan file.

data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}

I know I can have a data.R file that contains the following data

N <- 10
y <- c(3,3,4,5,4,7,8,7,7,8)
x <- c(1,2,1,2,1,3,3,4,3,4)

Using ./lin_reg sample data file=lin_reg.R then view the results with stansummary output.csv

This runs fine with no problem, but this seems like a toy example. In fact, most of the data I would like to use in stan is tidy. In this format there is no N<-10, just column names and their data for each observation. Is it possible to send a csv or dataframe in that is tidy to stan?

For example a tidy csv file would like like:

y   x
7  3
8  3
7  4
8  3
7  4
8  3

the file format that cmd stan accepts is the Rdump file format.

the underlying issue is what are you trying to compute?

in the above example, you’ve got alpha and beta and 10 data points. your long-format data structure is two arrays of length 20. your wide format data structure is two arrays of length 10. if you add another covariate, e.g. parameters intercept alpha (which in long format will have data column values of 1), and coefficients beta1, and beta2, then a long-format data input would be two arrays or length 30 and a wide format input would be 3 arrays of length 10. in order to compute:

y ~ normal(alpha + beta1 * x1 + beta2 * x2 , sigma);

you’re going to be computing with vectors whose length is the number of observations - i.e., wide-format data.

if your data happens to be in a long format, then you can convert it back to wide format before sending it to Stan or using the transformed data block to do the transformation. array variables must be declared with a specified size. again, you can declare things in the transformed data block, which allows you to compute the size from other inputs.

cheers,
Mitzi

Sorry, I don’t think I was very clear.

As an example, how do I go from the following csv to estimates for the linear regression parameters.

y,x
4.2,0.5
11.5,0.5
7.3,0.5
5.8,0.5
6.4,0.5
10,0.5
11.2,0.5
11.2,0.5
5.2,0.5
7,0.5
16.5,1
16.5,1
15.2,1
17.3,1
22.5,1
17.3,1
13.6,1
14.5,1
18.8,1
15.5,1
23.6,2
18.5,2
33.9,2
25.5,2
26.4,2
32.5,2
26.7,2
21.5,2
23.3,2
29.5,2
15.2,0.5
21.5,0.5
17.6,0.5
9.7,0.5
14.5,0.5
10,0.5
8.2,0.5
9.4,0.5
16.5,0.5
9.7,0.5
19.7,1
23.3,1
23.6,1
26.4,1
20,1
25.2,1
25.8,1
21.2,1
14.5,1
27.3,1
25.5,2
26.4,2
22.4,2
24.5,2
24.8,2
30.9,2
26.4,2
27.3,2
29.4,2
23,2

Does rstanarm not meet you needs? You’re basically asking for a different interface than we have which is fine but… it’s not what we have…

Cmdstan takes basically the same input as Rstan (what @mitzimorris was saying) . You can dump a data list meant for Rstan into a Cmdstan friendly format with stan_rdump (check the manual on it).

@Krzysztof_Sakrejda is right about Rstanarm – that’s the interface that is friendly to tidy data.

If you’re using your own models, no real way around data = list(N = nrow(df), y = df$y, x = df$x) that I know of.

2 Likes

rstanarm will do most of what I need it to, but the reason I am interested in stan is to have more flexibility in modeling. When I read through the stan users manual I see that there is a unified way to model anything from regression to time-series. As far as I am aware, rstanarm just gives you the front end of lm and glm while giving a stan back-end. It does not give you full flexibility to build models such as time-series or censored-data models, but maybe I am wrong.

Thanks @bbbales2 for letting me know that there is no way around data = list(N = nrow(df), y = df$y, x = df$x). This is good to know. I think that if I have tidy data to start with it would not be hard to provide a data list. It does make me wonder why stan requires N when N=nrow(y)?

Stan requires all container types to have a specified size in order to prevent run-time index-out-of-bounds errors.

RStanArm does a lot of analysis of your data to find the sizes and pass them in to its super-robust versions of common models.

however, if you’re doing custom models, you’re also going to have to do custom data-munging. a very large amount of the effort that goes into (good) data analysis is careful data munging - you just gotta learn to love the processing process.

that said, we do want to make I/O easier, so feedback always welcome.

1 Like

@mitzimorris Thanks, I think that rstan, not rstanarm, may be where I spend a lot of my time as it seems to provide more model flexibility while allowing me to use tidyverse tools for fast data manipulation.

As long as the number of column-variables I am working with is reasonable I think that something like data = list(N = nrow(df), y = df$y, x = df$x) does not cause too much of a hiccup in programmer-time.

if you are using tidyverse, you can go from long to wide quite easily.
Stan’s vectorized operations also allow you to write the following:

data {
  int<lower=1> N;
  int<lower=1> K;                 // num covariates
  int<lower=0> y[N];              // count outcomes
  matrix[N, K] x;                  // design matrix
}
parameters {
  real beta0;                // intercept
  vector[K] betas;       // covariates
}
model {
  y ~ poisson_log(beta0 + x * betas);

  beta0 ~ normal(0.0, 2.5);
  betas ~ normal(0.0, 2.5);
}

input data list is 4 items, no matter how many column variables.

2 Likes

Thanks for this great example. As someone who has made a habit out of making data long format. I that examples of this nature should be shown more often in documentation.

Usually our documentation is trying to explain just one thing at a time, so everything else is kept simple and fixed. But that leaves it up to the user to put everything together, which can be challenging, especially in a new language with strange data structures.

The case studies get into a lot more of this, but still probably not as much as they should.