Aggregating intermediate quantities by index variable

dholcomb · September 30, 2020, 10:36pm

data {
  // outcome information
  int<lower = 0> n_out;  // number of outcome observations
  int<lower = 0> y[n_out];  // outcome observations
  vector[n_out] log_off;  // offset in count regression; pass as log-transformed
  
  // exposure information
  int<lower = 0> n_expose;  // number of exposure locations
  vector[n_expose] weight_expose;  // exposure weight at each exposure location
  
  int<lower = 0> n_target;  // number of target locations to calculate exposure for
  int<lower = 0> n_zip;  // number of zipcodes; in unstratified case should equal n_out
  int<lower = 1, upper = n_zip> zip_target[n_target];  // target zip to aggregate to
  int<lower = 0> zip_target_obs[n_zip];  // number of targets to aggregate in each zipcode
  vector[n_target] weight_target;  // population weight at each target location for aggregation
  
  int<lower = 0> n_w;  // number of non-zero elements of distance matrix
  int<lower = 0> n_u;  // number row-start indicators in sparse distmat
  vector[n_w] w;  // non-zero elements of distmat
  int<lower = 0> v[n_w];  // column indices of non-zero distmat elements
  int<lower = 0> u[n_u];  // indicator of where in w a given row's non-zero values start

}


parameters {
  real b0;  // intercept
  real b1;  // coefficient on exposure metric
  real log_ar;  // spatial range parameter (log-scale for sampling)
}

transformed parameters{
  real<lower = 0> ar = exp(log_ar);  // spatial range parameter
}

model {
  vector[n_w] w_new = (-3 * w) / ar;  // scale distmat by spatial range (just positives vector)
  
  // sum of distance-decay exposures at each target loc
  vector[n_target] x_target = csr_matrix_times_vector(n_target, n_expose,
                                                      w_new, v, u,
                                                      weight_expose);
  
  vector[n_target] x_target_w = x_target .* weight_target; // weight by population fraction
  
  // aggregate pop-weighted blocks by zip code
  vector[n_zip] x;
  for (i in 1:n_zip) {
    int J = zip_target_obs[i];  // number of blocks in zipcode i
    vector[J] x_temp = x_target_w[zip_target[i]];
    x[i] = sum(x_temp);
  }
  
  vector[n_zip] log_x = log10(x);
  vector[n_zip] x_std = (log_x - mean(log_x)) / sd(log_x);
  
  // poisson likelihood
  y ~ poisson_log(log_off + b0 + beta * x_std);
  
  // priors
  b0 ~ normal(0, 5);
  b1 ~ normal(0, 2.5);
  log_ar ~ normal(8, 1.5) // can't center at 0, needs to be positive,
                          // also shouldn't be above log(30000) = 10.3
                          // because that's the maximum range we looked for farms
  
}

I have disease counts for US 5-digit zip codes but need to calculate an exposure variable from point locations. I would like to calculate exposures at finer spatial resolution (US Census Block centroids) and take the population-weighted mean to obtain a value of the exposure variable for each zip code. The plan is to use the population-weighted mean zip code exposure as a predictor in a Poisson model of the disease outcome. I want to allow the exposure to decay exponentially with distance, with the rate controlled by the spatial range parameter \alpha_{r}, which needs to be estimated. As such, I need to calculate the exposure for each block centroid and aggregate those values to zip codes with each model iteration. I’m sure there are numerous issues to consider with this approach, but at the moment I’m stuck on the fairly basic issue of summarizing a vector by groups. Essentially I am trying to do the stan equivalent of a dplyr::group_by() %>% dplyr::summarise(). I have attempted to do this using index variables and a for loop:

vector[n_target] x_target_w = x_target .* weight_target; // weight by population fraction
  
  // aggregate pop-weighted blocks by zip code
  vector[n_zip] x;
  for (i in 1:n_zip) {
    int J = zip_target_obs[i];  // number of blocks in zipcode i
    vector[J] x_temp = x_target_w[zip_target[i]];
    x[i] = sum(x_temp);
  }

But I am getting parsing errors about variable definition and dimension mismatches rstan::stan_model:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable definition base type mismatch, variable declared as base type vector variable definition has base type realVariable definition dimensions mismatch, definition specifies 1, declaration specifies 0 error in 'modelba6e55e7cb78_spatial_range' at line 56, column 48
  -------------------------------------------------
    54:   for (i in 1:n_zip) {
    55:     int J = zip_target_obs[i];  // number of blocks in zipcode i
    56:     vector[J] x_temp = x_target_w[zip_target[i]];
                                                       ^
    57:     x[i] = sum(x_temp);
  -------------------------------------------------

I suspect I am doing something silly and there is a simple solution, but I have not yet been able to work it out myself. Thanks!

andrjohns · October 1, 2020, 12:32am

Hi David,

The error message:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable definition base type mismatch, variable declared as base type vector variable definition has base type real

Implies that you’re trying to assign a real value to the entirety of a vector.

If you break down the loop a bit, you can see where this is going wrong:

  for (i in 1:n_zip) {
    // Select i-th zip code
    int J = zip_target_obs[i];

    // Create vector the same size as the selected zip-code 
    vector[J] x_temp;

    // Assign the weight for that zip code to the vector
    x_temp = x_target_w[zip_target[i]];

    // Sum the vector for that zip code
    x[i] = sum(x_temp);
  }

It’s not really clear why you have the vector[J] and sum() steps there, as you’re only working with a single scalar value for each zip code

dholcomb · October 1, 2020, 12:48pm

Hi Andrew,

Thanks for the explanation. My issue is that I intended to be working with a vector of values for each zip code, summing them up to produce a scalar summary value. The vector will be different lengths for each zip code. x_target_w is an n_target length vector (n_target is much longer than n_zip) containing weighted values at the target locations, which I want to sum up by zip code. zip_target is an n_target length integer array holding the zip code ID for each element of x_target_w. Each zip code has a different number of corresponding elements in x_target_w, so zip_target_obs is a length n_zip integer array with the number of elements in x_target_w for each zip code. What I am trying to do in the loop is pull out the vector of values in x_target_w corresponding to the current zip code and sum them. I just realized the issue is that x_temp = x_target_w[zip_target[i]] just selects the ith element of zip_target, when what I want to do is select all the elements in zip_target that equal i, using those to index x_target_w. I don’t beleive conditional indexing is possible in stan, so any advice on work arounds would be much appreciated.

paupak4 · January 5, 2021, 6:45pm

Hi! I have a similar issue, where you able to figure out how to do the dplyr equivalent in Stan? Thanks!

Topic		Replies	Views
A matrix of parameters in Stan Integral Modeling fitting-issues	41	1422	November 25, 2019
Intermediate quantities Modeling	3	713	August 22, 2017
Clarification regarding Jacobians - BYM2 model Modeling	5	274	October 26, 2023
Expanding the ICAR model with smoothing component to a spatiotemporal framework Modeling rstan	4	527	March 9, 2023
Rejecting Initial Value (Student_t distribution) Modeling	3	364	June 20, 2018

Aggregating intermediate quantities by index variable

Related Topics