Infer mean and std of four separate groups, need help indexing data vector

Jacob_Moore · June 28, 2022, 7:01pm

I’d like to infer the mean and standard deviation for four separate groups; these groups are partitioned by demand and experiment cell (eg test-high, test-low, ctrl-high, ctrl-low.)

In the likelihood declaration in my model block, I believe that there should be a single likelihood declaration rather than four separate declarations. To accomplish this change, I would need to consolidate every vector in the data section into a single vector then use indices to keep track of which data relate to which parameters. (This is the part that I need help with.)

How should i accomplish this effect?

normal_model = """
data { 
  int<lower=1> n_test_high;
  int<lower=1> n_test_low;
  int<lower=1> n_ctrl_high;
  int<lower=1> n_ctrl_low;  

  vector<lower=0>[n_test_high] X_test_high;  
  vector<lower=0>[n_test_low] X_test_low;    
  vector<lower=0>[n_ctrl_high] X_ctrl_high;  
  vector<lower=0>[n_ctrl_low] X_ctrl_low;      
}
parameters {
  real mu_test_high;
  real<lower=0> sigma_test_high;
  
  real mu_test_low;
  real<lower=0> sigma_test_low;

  real mu_ctrl_high;
  real<lower=0> sigma_ctrl_high;    

  real mu_ctrl_low;
  real<lower=0> sigma_ctrl_low;      
} 

transformed_parameters {
  real test_ratio = mu_test_high / mu_test_low;
  real ctrl_ratio = mu_ctrl_high / mu_ctrl_low;
  real delta_ratio = (mu_test_high / mu_test_low) - (mu_ctrl_high / mu_ctrl_low);  
} 


model {
  // Priors
  mu_test_high ~ normal(0, sqrt(1000));
  sigma_test_high ~inv_gamma(3,2); 

  mu_test_low ~ normal(0, sqrt(1000));
  sigma_test_low ~inv_gamma(3,2);   

  mu_ctrl_high ~ normal(0, sqrt(1000));
  sigma_ctrl_high ~inv_gamma(3,2);   
  
  mu_ctrl_low ~ normal(0, sqrt(1000));
  sigma_ctrl_low ~inv_gamma(3,2);     

  // Likelihood
  X_test_high ~ normal(mu_test_high, sigma_test_high);
  X_test_low ~ normal(mu_test_low, sigma_test_low);
  X_ctrl_high ~ normal(mu_ctrl_high, sigma_ctrl_high);
  X_test_low ~ normal(mu_ctrl_low, sigma_ctrl_low);    
}
"""

andrjohns · June 29, 2022, 6:13am

So the general idea is that you want to go from:

Group 1	Group 2	Group 3	Group 4
y_1	y_3	y_5	y_7
y_2	y_4	y_6	y_8

To:

y	Group
y_1	1
y_2	1
y_3	2
y_4	2
y_5	3
y_6	3
y_7	4
y_8	4

With this data structure, your Stan model then becomes:

data { 
  int<lower=1> N;
  int<lower=1> N_groups;
  vector<lower=0>[N] y;
  array[N] int group;
}
parameters {
  vector[N_groups] mu;
  vector<lower=0>[N_groups] sigma; 
} 

model {
  mu ~ normal(0, sqrt(1000));
  sigma ~ inv_gamma(3,2);

  y ~ normal(mu[group], sigma[group]);
}

Jacob_Moore · June 29, 2022, 2:41pm

Hi, thank you for the response! If I understand the below correctly:

y and group are vectors/arrays of the same length and capture the corresponding value and group assignment of each given y_i, respectively.
The statement mu ~normal(0, sqrt(1000)) applies the same sampling statement to each element in the vector mu but does not use pooling.

Am I correct in 1 & 2 above?

Cheers!

andrjohns · June 29, 2022, 5:08pm

Yep, correct on both counts

Jacob_Moore · June 29, 2022, 6:03pm

Encountered the below error, but the swap was simple int group[N];

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

  error in '' at line 6, column 2
  -------------------------------------------------
     4:   int<lower=1> N_groups;
     5:   vector<lower=0>[N] y;
     6:   array[N] int group;
         ^
     7: }
  -------------------------------------------------

PARSER EXPECTED: <one of the following:
  a variable declaration, beginning with type,
      (int, real, vector, row_vector, matrix, unit_vector,
       simplex, ordered, positive_ordered,
       corr_matrix, cov_matrix,
       cholesky_corr, cholesky_cov
  or '}' to close variable declarations>

andrjohns · June 29, 2022, 6:13pm

Ah, I’m guessing you’re using rstan then? The new array syntax isn’t yet available in the version of stan in rstan

adamConnerSax · June 29, 2022, 6:29pm

These two versions are equivalent, right?

I often end up in the first situation with multiple sampling statements involving the data: e.g., using multiple data sets to inform the likelihood. I’ve played around with “stacking” the data like this but, in my case, it often seems like more work and also somewhat less clear to read.

I don’t mean that this model is less clear either way. I just want to double check my understanding for my own purposes.

andrjohns · June 29, 2022, 6:35pm

Yep the two versions represent identical models. The second (“stacked”) version will (generally) be more efficient since a single vectorised likelihood can be applied to all observations

adamConnerSax · June 29, 2022, 6:37pm

Thanks! That makes sense. I think my cases often involve different densities (if that’s the word) for different data sets so I’d need to break up that loop anyway. But I can see how if it was uniform enough among the groups, stacking would be an efficiency win.

Jacob_Moore · June 29, 2022, 9:11pm

PyStan-- but it probably functions much the same as Rstan :)

Topic		Replies	Views
How to vectorize when dealing with vectors of different lengths while partial pooling? Modeling specification , hierarchical-model	9	931	August 11, 2020
Vectorizing sampling of arrays of vectors Modeling	6	690	January 22, 2020
Mean by groups Modeling	14	2510	October 31, 2019
Simple nested normal dist model help Modeling	10	1186	September 6, 2018
Calculate variance of parameter by Group Modeling	5	965	January 13, 2021

Infer mean and std of four separate groups, need help indexing data vector

Related topics