Problem Simulating Data with Generated Quantities (Dimension mismatch in assignment; type = real; right-hand side type = real[ ])

I am trying to start working directly with Stan code by simulating some data from priors using the generated quantities block. I’m struggling to match vectors and reals

# Fake starting data frame
n <- 20
data <- data.frame(
  SOGF = sample(c(0,1), n, replace = TRUE),
  STGF = sample(c(0,1), n, replace = TRUE),
  PK = rnorm(n, mean = 50, sd = 2),
  RTN = rnorm(n, mean = 0, sd = 1))
X <- model.matrix(data = data, 
                  RTN ~ SOGF + STGF + PK)
#Stan model
model_code = "
data {
  int n; // number of observations
  int k; // number of predictors
  matrix[n,k] X; // predictor matrix
  vector[n] Y; // outcome vector
  int<lower = 0, upper = 1> run_estimation; // evaluate the likelihood?
}
parameters {
  real alpha; // intercept
  vector[k] beta; // coefficients for predictors
  real<lower=0> sigma; // error scale
}
model {
  alpha ~ normal(0, 10);
  beta[1] ~ normal(0, 10); // prior for SOGF
  beta[2] ~ normal(0, 10); // prior for STGF
  beta[3] ~ normal(0, 10); // prior for PK
  sigma ~ std_normal();
  // conditionally run likelihood
  if(run_estimation==1){
    Y ~ normal(alpha + X * beta, sigma);
  }
}
generated quantities { 
 vector[n] Y_sim;
for(i in 1:n) {
    Y_sim[i] = normal_rng(alpha + X * beta, sigma); 
  }
} 
"
# Sample from model
sim_out <- stan(model_code = model_code, 
                data = list(n = n, 
                            k = ncol(X), 
                            X = X, 
                            Y = data$RTN, 
                            run_estimation = 0),
                    iter = 1000, 
                    chains = 1)

This gives me the following error:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Dimension mismatch in assignment; variable name = Y_sim, type = real; right-hand side type = real[ ].
Illegal statement beginning with non-void expression parsed as
  Y_sim[i]
Not a legal assignment, sampling, or function statement.  Note that
  * Assignment statements only allow variables (with optional indexes) on the left;
  * Sampling statements allow arbitrary value-denoting expressions on the left.
  * Functions used as statements must be declared to have void returns

 error in 'modelb69803c68a26d_6e8a02c9f71e8fbd8d0f5be02af0006c' at line 28, column 4
  -------------------------------------------------
    26:  vector[n] Y_sim;
    27: for(i in 1:n) {
    28:     Y_sim[i] = normal_rng(alpha + X * beta, sigma); 
           ^
    29:   }
  -------------------------------------------------

PARSER EXPECTED: "}"

If I comment out the generated quantities block I can sample from the fake data.

What am I doing incorrectly in that generated quantities block?

The error here is:

variable name = Y_sim, type = real; right-hand side type = real[ ]

Because the result of alpha + X * beta is a vector of length n, the vectorised version of normal_rng is being called, and attempting to return an array of reals (i.e., real[]), except you’re trying to assign the result to a single value (Y_sim[i]).

You can instead just allocate the vectorised return directly:

generated quantities { 
 real Y_sim[n] = normal_rng(alpha + X * beta, sigma); 
} 

Also, stan has a built-in distribution for linear regression which will sample more efficiently (and is GPU accelerated if you’re using cmdstanr).

You would specify it like so:

  if(run_estimation==1){
    Y ~ normal_id_glm(X, alpha, beta, sigma);
  }

Thank you once again, @andrjohns !!