Infer mean and std of four separate groups, need help indexing data vector

I’d like to infer the mean and standard deviation for four separate groups; these groups are partitioned by demand and experiment cell (eg test-high, test-low, ctrl-high, ctrl-low.)

In the likelihood declaration in my model block, I believe that there should be a single likelihood declaration rather than four separate declarations. To accomplish this change, I would need to consolidate every vector in the data section into a single vector then use indices to keep track of which data relate to which parameters. (This is the part that I need help with.)

How should i accomplish this effect?

normal_model = """
data { 
  int<lower=1> n_test_high;
  int<lower=1> n_test_low;
  int<lower=1> n_ctrl_high;
  int<lower=1> n_ctrl_low;  

  vector<lower=0>[n_test_high] X_test_high;  
  vector<lower=0>[n_test_low] X_test_low;    
  vector<lower=0>[n_ctrl_high] X_ctrl_high;  
  vector<lower=0>[n_ctrl_low] X_ctrl_low;      
}
parameters {
  real mu_test_high;
  real<lower=0> sigma_test_high;
  
  real mu_test_low;
  real<lower=0> sigma_test_low;

  real mu_ctrl_high;
  real<lower=0> sigma_ctrl_high;    

  real mu_ctrl_low;
  real<lower=0> sigma_ctrl_low;      
} 

transformed_parameters {
  real test_ratio = mu_test_high / mu_test_low;
  real ctrl_ratio = mu_ctrl_high / mu_ctrl_low;
  real delta_ratio = (mu_test_high / mu_test_low) - (mu_ctrl_high / mu_ctrl_low);  
} 


model {
  // Priors
  mu_test_high ~ normal(0, sqrt(1000));
  sigma_test_high ~inv_gamma(3,2); 

  mu_test_low ~ normal(0, sqrt(1000));
  sigma_test_low ~inv_gamma(3,2);   

  mu_ctrl_high ~ normal(0, sqrt(1000));
  sigma_ctrl_high ~inv_gamma(3,2);   
  
  mu_ctrl_low ~ normal(0, sqrt(1000));
  sigma_ctrl_low ~inv_gamma(3,2);     

  // Likelihood
  X_test_high ~ normal(mu_test_high, sigma_test_high);
  X_test_low ~ normal(mu_test_low, sigma_test_low);
  X_ctrl_high ~ normal(mu_ctrl_high, sigma_ctrl_high);
  X_test_low ~ normal(mu_ctrl_low, sigma_ctrl_low);    
}
"""
1 Like

So the general idea is that you want to go from:

Group 1 Group 2 Group 3 Group 4
y_1 y_3 y_5 y_7
y_2 y_4 y_6 y_8

To:

y Group
y_1 1
y_2 1
y_3 2
y_4 2
y_5 3
y_6 3
y_7 4
y_8 4

With this data structure, your Stan model then becomes:

data { 
  int<lower=1> N;
  int<lower=1> N_groups;
  vector<lower=0>[N] y;
  array[N] int group;
}
parameters {
  vector[N_groups] mu;
  vector<lower=0>[N_groups] sigma; 
} 

model {
  mu ~ normal(0, sqrt(1000));
  sigma ~ inv_gamma(3,2);

  y ~ normal(mu[group], sigma[group]);
}
2 Likes

Hi, thank you for the response! If I understand the below correctly:

  1. y and group are vectors/arrays of the same length and capture the corresponding value and group assignment of each given y_i, respectively.
  2. The statement mu ~normal(0, sqrt(1000)) applies the same sampling statement to each element in the vector mu but does not use pooling.

Am I correct in 1 & 2 above?

Cheers!

Yep, correct on both counts

1 Like

Encountered the below error, but the swap was simple int group[N];

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

  error in '' at line 6, column 2
  -------------------------------------------------
     4:   int<lower=1> N_groups;
     5:   vector<lower=0>[N] y;
     6:   array[N] int group;
         ^
     7: }
  -------------------------------------------------

PARSER EXPECTED: <one of the following:
  a variable declaration, beginning with type,
      (int, real, vector, row_vector, matrix, unit_vector,
       simplex, ordered, positive_ordered,
       corr_matrix, cov_matrix,
       cholesky_corr, cholesky_cov
  or '}' to close variable declarations>

Ah, I’m guessing you’re using rstan then? The new array syntax isn’t yet available in the version of stan in rstan

These two versions are equivalent, right?

I often end up in the first situation with multiple sampling statements involving the data: e.g., using multiple data sets to inform the likelihood. I’ve played around with “stacking” the data like this but, in my case, it often seems like more work and also somewhat less clear to read.

I don’t mean that this model is less clear either way. I just want to double check my understanding for my own purposes.

Yep the two versions represent identical models. The second (“stacked”) version will (generally) be more efficient since a single vectorised likelihood can be applied to all observations

1 Like

Thanks! That makes sense. I think my cases often involve different densities (if that’s the word) for different data sets so I’d need to break up that loop anyway. But I can see how if it was uniform enough among the groups, stacking would be an efficiency win.

1 Like

PyStan-- but it probably functions much the same as Rstan :)