Categorical model divergences/priors

jfhawkin · May 11, 2022, 6:22pm

time_cost_calculated.CSV (1.9 MB)
mtc_equity-v4.R (2.2 KB)

I’ve been working on this model for a few months to get it running in a reasonable time. I now have it running but am having some divergence issues. I’m running it again with a higher adapt_delta (0.9 vs. 0.8) and additional warmup iterations, but I know that’s not the preferred solution. The model is being run on a GPU with 9 threads per chain, so I wondered if having so many independent threads might be affecting the results. I currently use the default init values.

I was also thinking it might be helpful to check my prior predictive distribution, but I haven’t done that before. This is a simplified model from a version that will have additional contextual variables in a[n] that describe person attributes. The hope is to also introduce correlation between some parameters.

I’d adjusted my priors based on a previous run, but they are clearly not exploring the full parameter space. I tried some visual diagnostics.

My parameter plot looks quite different from the examples. There seem to be a few tight regions explored across iterations.

My standard deviations are also clustered rather than having the normal distribution found in the online examples. Either they are truly multi-modal or my model/priors are wrong.

Same pattern for the beta parameters.

functions {
  int[] sequence(int start, int end) { 
    int seq[end - start + 1];
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 

  real partial_log_lik_lpmf(int[] seq, int start, int end,
		     data int ncat, data int nvar, data int ntotvar, data int[] Y, data matrix x1, data matrix x2, data matrix x3, data matrix x4, data int[] J_1, data int[] J_2, matrix r_1, matrix r_2, 
                     real b_a2, real b_a3, real b_a4, real b_btc, real b_b1tt, real b_b2tt, real b_b3tt, real b_b4tt) {
                     
    real ptarget = 0;
    int N = end - start + 1;
    // Define matrices/vector for x, alpha, beta
    matrix[ncat,nvar] x;
    vector[ncat] alpha;
    matrix[nvar,ncat] beta;

    array[N] row_vector[ncat] a = rep_array([0,b_a2,b_a3,b_a4], N);
      
    // initialize linear predictor term
    // Terms are btc, b1tt, b2tt, b3tt, b4tt

    array[N] row_vector[ntotvar] b = rep_array([b_btc,b_b1tt,b_b2tt,b_b3tt,b_b4tt], N);

    for (n in 1:N) {
      int nn = n + start - 1;
      // add to linear predictor term
      a[n] += [0,r_1[J_1[nn],1] + r_2[J_2[nn],1], r_1[J_1[nn],2] + r_2[J_2[nn],2], r_1[J_1[nn],3] + r_2[J_2[nn],3]];
      // add to linear predictor term
      // Terms are btc, b1tt, b2tt, b3tt, b4tt
      b[n] += [r_1[J_1[nn],4] + r_2[J_2[nn],4], r_1[J_1[nn],5] + r_2[J_2[nn],5], r_1[J_1[nn],6] + r_2[J_2[nn],6],r_1[J_1[nn],7] + r_2[J_2[nn],7],r_1[J_1[nn],8] + r_2[J_2[nn],8]];
      b[n] = exp(b[n]);
      // Each x and beta is a matrix with dimensions alts x variables
      // Our x will be the time/cost coming in as inputs
      x = to_matrix([x1[nn],x2[nn],x3[nn],x4[nn]]);

      // Our betas will be the hierarchical slope parameters (2 x 4)
      beta = [b[n,1] * b[n,2:5], rep_row_vector(b[n,1],ncat)];
      // Our alphas will be the hierarchical intercept parameters
      alpha = [a[n,1], b[n,1] * a[n,2], b[n,1] * a[n,3], b[n,1] * a[n,4]]';
      ptarget += categorical_logit_glm_lupmf(Y[nn] | x, alpha, beta);
    }
    return ptarget;
  }
}

data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int<lower=2> nvar;  // number of variables for each alternative
  int<lower=2> ntotvar;  // number of total variables
  int grainsize;  // grainsize for threading
  array[N] int Y;  // response variable
  // covariate matrix for alt1
  matrix[N,nvar] x1; // matrix[N,nvar] x1;
  // covariate matrix for alt2 
  matrix[N,nvar] x2; // matrix[N,nvar] x2;
  // covariate matrix for alt3 
  matrix[N,nvar] x3; // matrix[N,nvar] x3;
  // covariate matrix for alt4
  matrix[N,nvar] x4; // matrix[N,nvar] x4;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  array[N] int<lower=1> J_2;  // grouping indicator per observation
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int seq[N] = sequence(1, N);
}
parameters {
  real b_btc;  // population-level effects
  real b_a2;  // population-level effects
  real b_a3;  // population-level effects
  real b_a4;  // population-level effects
  real b_b1tt;  // population-level effects
  real b_b2tt;  // population-level effects
  real b_b3tt;  // population-level effects
  real b_b4tt;  // population-level effects
  row_vector<lower=0>[8] sd1;  // group-level standard deviations
  row_vector<lower=0>[8] sd2;  // group-level standard deviations
  matrix<multiplier = rep_matrix(sd1, N_1)>[N_1, 8] r_1;
  matrix<multiplier = rep_matrix(sd2, N_2)>[N_2, 8] r_2;
}
model {
  // likelihood including constants
  if (!prior_only) {
       target += reduce_sum(partial_log_lik_lpmf, seq,
                     grainsize, ncat, nvar, ntotvar, Y, x1, x2, x3, x4, J_1, J_2,
		     r_1, r_2, b_a2, b_a3, b_a4, b_btc, b_b1tt, b_b2tt, b_b3tt, b_b4tt);

  }
  // priors
  target += normal_lpdf(b_btc | 0, 5);
  target += normal_lpdf(b_a2 | 0, 25);
  target += normal_lpdf(b_a3 | 0, 25);
  target += normal_lpdf(b_a4 | 0, 25);
  target += normal_lpdf(b_b1tt | 0, 5);
  target += normal_lpdf(b_b2tt | 0, 2.5);
  target += normal_lpdf(b_b3tt | 0, 1.5);
  target += normal_lpdf(b_b4tt | 0, 5);
  target += student_t_lpdf(sd1 | 3, 0, 2.5);
  target += student_t_lpdf(sd2 | 3, 0, 2.5);
  for (k in 1:8) {
    r_1[, k] ~ normal(0, sd1[k]);
    r_2[, k] ~ normal(0, sd2[k]);
  }
}
}

jfhawkin · May 16, 2022, 2:14pm

I had to restart it because my slurm job timed out. I’m hoping to use that to do some additional diagnostics on the priors and model.

In the meantime, @stevebronder suggested in another of my threads on this model (sorry for the multiple threads, but it helps to keep issues separate as I work through getting the model running) that I switch from cmdstanr to cmdstan. I’m running the cmdstanr version with additional warmups and a higher adapt_delta.

Using cmdstanr, I was getting some messages during warmup that I ignored based on this thread. I get the same warnings running on cmdstan, just more of them (~3000 with cmdstan vs. ~100 with cmdstanr) and it ultimately ends with the below messages. I’m using the same starting seed and setup, as far as I can tell.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: normal_lpdf(OpenCL): Random variable[-1076144195, 1846113984] = 0, but it must be not NaN! (in '../mtc_equity_7.stan', line 115, column 4 to column 33)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: offset_multiplier_constrain: multiplier[1, 1] is -nan, but must be positive finite! (in '../mtc_equity_7.stan', line 91, column 2 to column 56)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: offset_multiplier_constrain: multiplier[1, 1] is -nan, but must be positive finite! (in '../mtc_equity_7.stan', line 91, column 2 to column 56)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: offset_multiplier_constrain: multiplier[1, 1] is -nan, but must be positive finite! (in '../mtc_equity_7.stan', line 91, column 2 to column 56)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: offset_multiplier_constrain: multiplier[1, 1] is -nan, but must be positive finite! (in '../mtc_equity_7.stan', line 91, column 2 to column 56)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: copy (OpenCL)->Eigen: clWaitForEvents CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST: Unknown error -14 (in '../mtc_equity_7.stan', line 114, column 4 to column 33)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

terminate called without an active exception
terminate called recursively
[c2420:1401362] *** Process received signal ***
/var/spool/slurmd/job41230661/slurm_script: line 28: 1401362 Aborted                 (core dumped) ./mtc_equity_7 sample num_chains=4 data file=mtc_data.json output file=output.csv num_threads=36 random seed=76543

The noted code blocks are:

stan
  row_vector<lower=0>[8] sd1;  // group-level standard deviations
  row_vector<lower=0>[8] sd2;  // group-level standard deviations
  matrix<multiplier = rep_matrix(sd1, N_1)>[N_1, 8] r_1; // ln 91
  matrix<multiplier = rep_matrix(sd2, N_2)>[N_2, 8] r_2;

stan
  target += student_t_lpdf(sd1 | 3, 0, 2.5);
  target += student_t_lpdf(sd2 | 3, 0, 2.5);
  for (k in 1:8) {
    r_1[, k] ~ normal(0, sd1[k]); // ln 114
    r_2[, k] ~ normal(0, sd2[k]);
  }

jfhawkin · May 23, 2022, 4:45pm

I am still working through this problem. I simplified the model to remove some sd terms for the intercepts, which were large and not necessary for the model. I also noticed I was applying a parameter to two categories with zero values across observations (zero values are correct but no need to apply the parameter to them).

The updated code (with changes noted in comments) is shown below. I set adapt_delta=0.9 and increased warmup iterations from 1000 to 2000. I get 12% divergences and 88% transitions hitting max tree depth.

Plots updated from initial posting are given below. I still get clumped parameter distributions. Should I be changing from normal priors? I have a more complex model that includes additional variables at the higher hierarchical level (contextual variables). I have that model running but 100% of the transitions have divergences. All of this is using cmdstanr. I still can’t get cmdstan to run without errors. I’m still working on the cmdstan problem (I also noticed cmdstan takes much longer to compile vs cmdstandr models). The plus side of everything is that my model now runs in 35 hours with my full dataset!

I’ve included both model Stan files and the R file (csv data unchanged and could not upload json file for use with cmdstan).
mtc_equity-v4.R (2.2 KB)
mtc_equity_complex.stan (7.8 KB)
mtc_equity_simple.stan (4.5 KB)

stan
functions {
  int[] sequence(int start, int end) { 
    int seq[end - start + 1];
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 

  real partial_log_lik_lpmf(int[] seq, int start, int end,
		     data int ncat, data int nvar, data int ntotvar, data int[] Y, data matrix x1, data matrix x2, data matrix x3, data matrix x4, data int[] J_1, data int[] J_2, matrix r_1, matrix r_2, 
                     real b_a2, real b_a3, real b_a4, real b_btc, real b_b1tt, real b_b2tt, real b_b3tt, real b_b4tt) {
                     
    real ptarget = 0;
    int N = end - start + 1;
    // Define matrices/vector for x, alpha, beta
    matrix[ncat,nvar] x;
    vector[ncat] alpha;
    matrix[nvar,ncat] beta;

    array[N] row_vector[ncat] a = rep_array([0,b_a2,b_a3,b_a4], N);
      
    // initialize linear predictor term
    // Terms are btc, b1tt, b2tt, b3tt, b4tt

    array[N] row_vector[ntotvar] b = rep_array([b_btc,b_b1tt,b_b2tt,b_b3tt,b_b4tt], N);

    for (n in 1:N) {
      int nn = n + start - 1;
      // add to linear predictor term
      a[n] += [0,r_1[J_1[nn],1], r_1[J_1[nn],2], r_1[J_1[nn],3]];
      // add to linear predictor term
      // Terms are btc, b1tt, b2tt, b3tt, b4tt
      b[n] += [r_1[J_1[nn],4] + r_2[J_2[nn],1], r_1[J_1[nn],5] + r_2[J_2[nn],2], r_1[J_1[nn],6] + r_2[J_2[nn],3],r_1[J_1[nn],7] + r_2[J_2[nn],4],r_1[J_1[nn],8] + r_2[J_2[nn],5]];
      b[n] = exp(b[n]);
      // Each x and beta is a matrix with dimensions alts x variables
      // Our x will be the time/cost coming in as inputs
      x = to_matrix([x1[nn],x2[nn],x3[nn],x4[nn]]);

      // Our betas will be the hierarchical slope parameters (2 x 4)
      beta = [b[n,1] * b[n,2:5], [b[n,1],0,0,b[n,1]]]; // CHANGE: Don't apply cost parameter to categories 2 and 3
      // Our alphas will be the hierarchical intercept parameters
      alpha = [a[n,1], b[n,1] * a[n,2], b[n,1] * a[n,3], b[n,1] * a[n,4]]';
      ptarget += categorical_logit_glm_lupmf(Y[nn] | x, alpha, beta);
    }
    return ptarget;
  }
}

data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int<lower=2> nvar;  // number of variables for each alternative
  int<lower=2> ntotvar;  // number of total variables
  int grainsize;  // grainsize for threading
  array[N] int Y;  // response variable
  // covariate matrix for alt1
  matrix[N,nvar] x1; // matrix[N,nvar] x1;
  // covariate matrix for alt2 
  matrix[N,nvar] x2; // matrix[N,nvar] x2;
  // covariate matrix for alt3 
  matrix[N,nvar] x3; // matrix[N,nvar] x3;
  // covariate matrix for alt4
  matrix[N,nvar] x4; // matrix[N,nvar] x4;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  array[N] int<lower=1> J_2;  // grouping indicator per observation
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int seq[N] = sequence(1, N);
}
parameters {
  real b_btc;  // population-level effects
  real b_a2;  // population-level effects
  real b_a3;  // population-level effects
  real b_a4;  // population-level effects
  real b_b1tt;  // population-level effects
  real b_b2tt;  // population-level effects
  real b_b3tt;  // population-level effects
  real b_b4tt;  // population-level effects
  row_vector<lower=0>[8] sd1;  // group-level standard deviations
  row_vector<lower=0>[5] sd2;  // group-level standard deviations // CHANGE: Remove 3 sd parameters for intercepts
  matrix<multiplier = rep_matrix(sd1, N_1)>[N_1, 8] r_1;
  matrix<multiplier = rep_matrix(sd2, N_2)>[N_2, 5] r_2;
}
model {
  // likelihood including constants
  if (!prior_only) {
       target += reduce_sum(partial_log_lik_lpmf, seq,
                     grainsize, ncat, nvar, ntotvar, Y, x1, x2, x3, x4, J_1, J_2,
		     r_1, r_2, b_a2, b_a3, b_a4, b_btc, b_b1tt, b_b2tt, b_b3tt, b_b4tt);

  }
  // priors
  target += normal_lpdf(b_btc | 0, 5);
  target += normal_lpdf(b_a2 | 0, 25);
  target += normal_lpdf(b_a3 | 0, 25);
  target += normal_lpdf(b_a4 | 0, 25);
  target += normal_lpdf(b_b1tt | 0, 5);
  target += normal_lpdf(b_b2tt | 0, 2.5);
  target += normal_lpdf(b_b3tt | 0, 1.5);
  target += normal_lpdf(b_b4tt | 0, 5);
  target += student_t_lpdf(sd1 | 3, 0, 2.5);
  target += student_t_lpdf(sd2 | 3, 0, 2.5);
  for (k in 1:8) {
    r_1[, k] ~ normal(0, sd1[k]);
    if (k<6){
      r_2[, k] ~ normal(0, sd2[k]); // CHANGE: Remove 3 sd parameters for intercepts
    }
  }
}

jfhawkin · May 23, 2022, 7:58pm

I’m not sure if this helps diagnose the problem or adds confusion. I can run my full model [mtc_equity_complex.stan) with cmdstanr and it gives me:

Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: normal_lpdf(OpenCL): Scale parameter = 0, but it must be positive! (in '/tmp/RtmpKndH5b/model-34e20c24215f9f.stan', line 178, column 4 to column 35)

It runs to completion (albeit with many divergences). If I run the same model with cmdstan, it gives me a different set of messages an error after just a few iterations.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: student_t_lpdf(OpenCL): Random variable[-1064435709, 645135378] = -960, but it must be not NaN! (in '../mtc_equity_9.stan', line 167, column 2 to column 55)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: offset_multiplier_constrain: multiplier[1, 1] is -nan, but must be positive finite! (in '../mtc_equity_9.stan', line 130, column 2 to column 56)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

terminate called without an active exception

jfhawkin · June 1, 2022, 12:40am

I’ve run some additional tests to understand what’s happening with my model. I tried changes adapt_delta and max_treedepth, but those changes don’t help much (and take much longer to run). My last strategy was to run some synthetic data tests and modify my model a bit. I’ll describe the simplest version I’m working on, which includes many of the common issues and should be extensible to the model I’d like to estimate.

The model is a 4 category categorical/logit model. I generate 1000 individuals with 3 replications and 5 groups for the hierarchical specification. The logit function is given by
\begin{align} \frac{e^V_i}{{\sum_j e^V_j}} \end{align}

Each V_i is then given by (for alternative i and individual n)

\begin{align} V_{in} = -\beta_{cost,n} \times cost_{in} + \beta_{cost,n} \times (\beta_{time,in} \times time_{in}) + \beta_{cost,n} \times (\alpha_{in}) + \epsilon_{in} \end{align}

This is a specification defined by Hess and Train (2017) to convert a standard utility model into willingness-to-pay (WTP) space. \beta_{cost,n} is common across attributes and captures scale heterogeneity and \alpha_{in} is a category-specific hierarchical intercept.

\beta_{cost,n} is given a lognormal distribution to recognize that microeconomic theory assumes negative utility for cost (interestingly Daniel McFadden, who originated much of this work, just published an article on the underlying theory for why a lognormal distribution is appropriate). Time (travel time) is generally assumed negative, so I also use a negative lognormal transformation. The \beta_{cost,n} parameter is a combination of a scale term \sigma_n and a price term \beta_{cost,n}, which cannot be separately identified.

My model has two non-nested hierarchical structures: 1) by trip purpose (a few categories) and 2) by respondent (many categories). I’ve specified two tests, each containing only one of the two hierarchical parameter sets.

Model 1 had 9 divergent transitions and none hitting the max treedepth. The plots look fairly reasonable to me except for one parameter, b_bc. It’s also doing a good job matching most parameters to their synthetic input values. The rhat also looks good other than that one parameter.

Model 2 (hierarchical by individual) runs with 96% divergences and 4% of transitions hitting the max treedepth. The parameter distribution still has some big peaks and sd parameter values (i.e., 10^7 range for a couple). I’m attaching the synthetic data and Stan code for this model. The RData file is too large to upload so I’ve added the first rows from the summary. I was reading Underdetermined Linear Regression and Michael Betancourt’s discussion about undetermined models. My model will definitely have this problem because there are many more individual parameters than data points when I estimate the full model or even just the model with individual hierarchical effects (1000 \times 4 = 4000 individual parameters)
synthetic_test-v1.R (4.0 KB)
mtc_equity_7-v1.stan (4.1 KB)

variable	mean	median	sd	mad	q5	q95	rhat	ess_bulk
1 lp__	-1.54e+4	-1.58e+4	652.	139.	-1.59e4	-1.42e+4	2.11	5.24
2 b_btc	-9.22e+0	-2.89e+0	19.0	6.17	-4.99e1	8.95e+0	3.50	4.40
3 b_a2	-1.02e+0	-1.17e+0	1.07	0.940	-2.89e0	6.43e-1	2.02	5.49
4 b_a3	-2.58e+0	-2.14e+0	1.40	1.75	-4.44e0	-7.96e-1	1.83	5.85
5 b_a4	5.37e+0	5.59e+0	1.45	1.69	2.93e0	7.25e+0	2.06	5.34
6 b_bc	-1.34e+1	-1.35e+1	1.11	0.967	-1.51e1	-1.15e+1	3.00	4.57
7 b_b1tt	-3.96e+0	-3.80e+0	1.25	0.523	-5.93e0	-1.86e+0	2.44	4.89
8 b_b2tt	2.56e+0	3.35e-2	5.03	1.75	-1.79e0	1.13e+1	2.33	5.08
9 b_b3tt	-9.88e-1	-8.84e-1	1.18	1.66	-2.64e0	7.21e-1	1.91	5.66
10 b_b4tt	-4.45e+0	-4.94e+0	1.76	1.15	-6.53e0	-1.14e+0	2.82	4.64

For those who don’t want to open files… the Stan code is

stan
functions {
  int[] sequence(int start, int end) { 
    int seq[end - start + 1];
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 

  real partial_log_lik_lpmf(int[] seq, int start, int end,
		     data int ncat, data int nvar, data int ntotvar, data int[] Y, data matrix x1, data matrix x2, data matrix x3, data matrix x4, data int[] J_1, matrix r_1, 
                     real b_a2, real b_a3, real b_a4, real b_bc, real b_b1tt, real b_b2tt, real b_b3tt, real b_b4tt) {
                     
    real ptarget = 0;
    int N = end - start + 1;
    // Define matrices/vector for x, alpha, beta
    matrix[ncat,nvar] x;
    vector[ncat] alpha;
    matrix[nvar,ncat] beta;

    array[N] row_vector[ncat] a = rep_array([0,b_a2,b_a3,b_a4], N);
      
    // initialize linear predictor term
    // Terms are btc (median 1), b1tt, b2tt, b3tt, b4tt

    array[N] row_vector[ntotvar] b = rep_array([b_bc,b_b1tt,b_b2tt,b_b3tt,b_b4tt], N);

    for (n in 1:N) {
      int nn = n + start - 1;
      // add to linear predictor term
      a[n] += [0,r_1[J_1[nn],1], r_1[J_1[nn],2], r_1[J_1[nn],3]];
      // add to linear predictor term
      // Terms are btc (0), b1tt, b2tt, b3tt, b4tt
      b[n] += [r_1[J_1[nn],4], r_1[J_1[nn],5], r_1[J_1[nn],6],r_1[J_1[nn],7],r_1[J_1[nn],8]];
      b[n] = exp(b[n]); // lognormal because cost and WTP are single signed to match microeconomic theory
      // Each x and beta is a matrix with dimensions alts x variables
      // Our x will be the time/cost coming in as inputs
      x = to_matrix([x1[nn],x2[nn],x3[nn],x4[nn]]);

      // Our betas will be the hierarchical slope parameters (2 x 4)
      beta = -1*[b[n,1] * b[n,2:5], [b[n,1],0,0,b[n,1]]];
      // Our alphas will be the hierarchical intercept parameters
      alpha = [a[n,1], b[n,1] * a[n,2], b[n,1] * a[n,3], b[n,1] * a[n,4]]';
      ptarget += categorical_logit_glm_lupmf(Y[nn] | x, alpha, beta);
    }
    return ptarget;
  }
}

data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int<lower=2> nvar;  // number of variables for each alternative
  int<lower=2> ntotvar;  // number of total variables
  int grainsize;  // grainsize for threading
  array[N] int Y;  // response variable
  // covariate matrix for alt1
  matrix[N,nvar] x1; // matrix[N,nvar] x1;
  // covariate matrix for alt2 
  matrix[N,nvar] x2; // matrix[N,nvar] x2;
  // covariate matrix for alt3 
  matrix[N,nvar] x3; // matrix[N,nvar] x3;
  // covariate matrix for alt4
  matrix[N,nvar] x4; // matrix[N,nvar] x4;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int seq[N] = sequence(1, N);
}
parameters {
  real b_btc;  // population-level effects
  real b_a2;  // population-level effects
  real b_a3;  // population-level effects
  real b_a4;  // population-level effects
  real b_bc;  // population-level effects
  real b_b1tt;  // population-level effects
  real b_b2tt;  // population-level effects
  real b_b3tt;  // population-level effects
  real b_b4tt;  // population-level effects
  row_vector<lower=0>[8] sd1;  // group-level standard deviations
  matrix<multiplier = rep_matrix(sd1, N_1)>[N_1, 8] r_1;
}
model {
  // likelihood including constants
  if (!prior_only) {
       target += reduce_sum(partial_log_lik_lpmf, seq,
                     grainsize, ncat, nvar, ntotvar, Y, x1, x2, x3, x4, J_1,
		     r_1, b_a2, b_a3, b_a4, b_bc, b_b1tt, b_b2tt, b_b3tt, b_b4tt);

  }
  // priors
  target += normal_lpdf(b_a2 | -2, 1);
  target += normal_lpdf(b_a3 | -3.5, 1);
  target += normal_lpdf(b_a4 | 4, 1);
  target += normal_lpdf(b_bc | -0.5, 1);
  target += normal_lpdf(b_b1tt | -1, 1);
  target += normal_lpdf(b_b2tt | -0.3, 1);
  target += normal_lpdf(b_b3tt | -0.5, 1);
  target += normal_lpdf(b_b4tt | -1.7, 1);
  target += student_t_lpdf(sd1 | 3, 0, 1);
  for (k in 1:8) {
    r_1[, k] ~ normal(0, sd1[k]);
  }
}

jfhawkin · June 3, 2022, 1:44pm

Following the recommendations on the Stan warning page, I tried isolating parameters by moving some to the data block. I ran 3 additional models on the synthetic data (note: even for the 3000 observation synthetic dataset, the model takes ~16-20 hours to run, so still a very slow testing process). These models all include both non-nested hierarchies (by individual and by purpose). I exclude the b_as (intercepts), b_bc (cost), and b_t (time) parameters, in turn.

All 3 model still have a large fraction of divergent transitions and max_treepdepth results (running with adapt_delta and max_treedepth at their defaults. I also get fairly “spikey” parameter results. Some of the parameters reproduce the synthetic dataset values, but I don’t see any clear patterns (i.e., removing a group of parameters doesn’t give a clear improvement in results).

I also tried wider priors on all parameters, which ok (not great) for the parameter distribution but doesn’t recover parameters as well as the restricted models. I have another model running with the std. dev. fixed by moving them to the data block.

I can try changing initial values, priors, and adapt_delta/max_treedepth (based on recommendations on Stan warnings page). I have found that increasing max_treedepth from 10 to 12 significantly increases runtime (i.e., I’ve yet to have one finish before my HPC times out after 48 hours.

Current model (priors centered at their true values in many cases with tight priors):

stan
functions {
  int[] sequence(int start, int end) { 
    int seq[end - start + 1];
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 

  real partial_log_lik_lpmf(int[] seq, int start, int end,
		     data int ncat, data int nvar, data int ntotvar, data int[] Y, data matrix x1, data matrix x2, data matrix x3, data matrix x4, data int[] J_1, data int[] J_2, matrix r_1, matrix r_2, 
                     real b_a2, real b_a3, real b_a4, real b_bc, real b_b1tt, real b_b2tt, real b_b3tt, real b_b4tt) {
                     
    real ptarget = 0;
    int N = end - start + 1;
    // Define matrices/vector for x, alpha, beta
    matrix[ncat,nvar] x;
    vector[ncat] alpha;
    matrix[nvar,ncat] beta;

    array[N] row_vector[ncat] a = rep_array([0,b_a2,b_a3,b_a4], N);
      
    // initialize linear predictor term
    // Terms are btc (median 1), b1tt, b2tt, b3tt, b4tt

    array[N] row_vector[ntotvar] b = rep_array([b_bc,b_b1tt,b_b2tt,b_b3tt,b_b4tt], N);

    for (n in 1:N) {
      int nn = n + start - 1;
      // add to linear predictor term
      a[n] += [0,r_1[J_1[nn],1], r_1[J_1[nn],2], r_1[J_1[nn],3]];
      // add to linear predictor term
      // Terms are btc (0), b1tt, b2tt, b3tt, b4tt
      b[n] += [r_1[J_1[nn],4] + r_2[J_2[nn],1], r_1[J_1[nn],5] + r_2[J_2[nn],2], r_1[J_1[nn],6] + r_2[J_2[nn],3],r_1[J_1[nn],7] + r_2[J_2[nn],4],r_1[J_1[nn],8] + r_2[J_2[nn],5]];
      b[n] = exp(b[n]); // lognormal because cost and WTP are single signed to match microeconomic theory
      // Each x and beta is a matrix with dimensions alts x variables
      // Our x will be the time/cost coming in as inputs
      x = to_matrix([x1[nn],x2[nn],x3[nn],x4[nn]]);

      // Our betas will be the hierarchical slope parameters (2 x 4)
      beta = -1*[b[n,1] * b[n,2:5], [b[n,1],0,0,b[n,1]]];
      // Our alphas will be the hierarchical intercept parameters
      alpha = [a[n,1], b[n,1] * a[n,2], b[n,1] * a[n,3], b[n,1] * a[n,4]]';
      ptarget += categorical_logit_glm_lupmf(Y[nn] | x, alpha, beta);
    }
    return ptarget;
  }
}

data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int<lower=2> nvar;  // number of variables for each alternative
  int<lower=2> ntotvar;  // number of total variables
  int grainsize;  // grainsize for threading
  array[N] int Y;  // response variable
  // covariate matrix for alt1
  matrix[N,nvar] x1; // matrix[N,nvar] x1;
  // covariate matrix for alt2 
  matrix[N,nvar] x2; // matrix[N,nvar] x2;
  // covariate matrix for alt3 
  matrix[N,nvar] x3; // matrix[N,nvar] x3;
  // covariate matrix for alt4
  matrix[N,nvar] x4; // matrix[N,nvar] x4;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  array[N] int<lower=1> J_2;  // grouping indicator per observation
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int seq[N] = sequence(1, N);
}
parameters {
  real b_a2;  // population-level effects
  real b_a3;  // population-level effects
  real b_a4;  // population-level effects
  real b_bc;  // population-level effects
  real b_b1tt;  // population-level effects
  real b_b2tt;  // population-level effects
  real b_b3tt;  // population-level effects
  real b_b4tt;  // population-level effects
  row_vector<lower=0>[8] sd1;  // group-level standard deviations
  row_vector<lower=0>[5] sd2;  // group-level standard deviations
  matrix<multiplier = rep_matrix(sd1, N_1)>[N_1, 8] r_1;
  matrix<multiplier = rep_matrix(sd2, N_2)>[N_2, 5] r_2;
}
model {
  // likelihood including constants
  if (!prior_only) {
       target += reduce_sum(partial_log_lik_lpmf, seq,
                     grainsize, ncat, nvar, ntotvar, Y, x1, x2, x3, x4, J_1, J_2,
		     r_1, r_2, b_a2, b_a3, b_a4, b_bc, b_b1tt, b_b2tt, b_b3tt, b_b4tt);

  }
  // priors
  target += normal_lpdf(b_a2 | -2, 1);
  target += normal_lpdf(b_a3 | -3.5, 1);
  target += normal_lpdf(b_a4 | 4, 1);
  target += normal_lpdf(b_bc | -0.5, 1);
  target += normal_lpdf(b_b1tt | -1, 1);
  target += normal_lpdf(b_b2tt | -0.3, 1);
  target += normal_lpdf(b_b3tt | -0.5, 1);
  target += normal_lpdf(b_b4tt | -1.7, 1);
  target += student_t_lpdf(sd1 | 3, 0, 1);
  target += student_t_lpdf(sd2 | 3, 0, 1);
  for (k in 1:8) {
    r_1[, k] ~ normal(0, sd1[k]);
    if (k<6){
      r_2[, k] ~ normal(0, sd2[k]);
    }
  }
}

jfhawkin · June 6, 2022, 1:21pm

I’ve now tried changing the priors and adapt_delta to minimal effect. I reviewed my model specification against the initial model I generated in brms a while back. It looks ok to me. I’m running an even simpler version of the model with my real data where I use a hierarchy for the intercepts on purpose (6 purposes) and a hierarchy on the slopes by individual (i.e., each variable has only one hierarchical parameter). I’m hoping I can get this working and add group variables/correlations later.

The synthetic data model RDS are too large to upload. I’ve moved one of the synthetic data models (with its RDS, R, and Stan files) to a Dropbox folder. This model still has the original hierarchical specification (i.e., not simplified). I’d appreciate any suggestions.

jfhawkin · June 8, 2022, 3:09am

My simplified model is giving slightly more reasonable, albeit still not right, results.

I think one of the reasons for the large sd2’s (which are added to b_a2, b_a3, and b_a4) is that they are counteracting the b_bc, b_b1_tt, b2_tt, b3_tt, and b4_tt (which have sd1’s added to them), which have an exp transformation to give me a LN() rather than N() distribution on them.

I’m running it again with adapt_delta=0.95 (up from 0.90). I’m still confused why it runs with cmdstanr but not cmdstan. Both give a series of messagesException: offset_multiplier_constrain: multiplier[1, 1] is 0, but must be positive finite!, but cmdstanr continues running and stops giving the messages while cmdstan stops with

terminate called without an active exception
[c3208:1143527] *** Process received signal ***
[c3208:1143527] Signal: Aborted (6)
[c3208:1143527] Signal code:  (-6)

I’m testing it by running the cmdstan version with startup values from the cmdstanr run (thanks to this thread for the help). I changed my priors a bit and am trying to give it a metric and/or init json input. I can’t seem to get the parameters in the right order for cmdstan though.

./mtc_equity_8 sample num_warmup=1000 num_chains=4 num_threads=32 \
	    init=test_inits.json \
           data file=mtc_data_V3.json \
           output file=output_${i}.csv refresh=100 \
           adapt delta=0.9 \
	     random seed=76543

or

./mtc_equity_8 sample num_warmup=1000 num_chains=4 num_threads=32 \
           data file=mtc_data_V3.json \
           output file=output_${i}.csv refresh=100 \
           metric=diag_e metric_file=my_metric1.json \
           adapt delta=0.9 \
           algorithm=hmc engine=nuts max_depth=10 \
	     random seed=76543

or

./mtc_equity_8 sample num_warmup=1000 num_chains=4 num_threads=32 \
           data file=mtc_data_V3.json \
           output file=output_${i}.csv refresh=100 \
           adapt delta=0.9 \
	     random seed=76543

mitzimorris · June 8, 2022, 11:39am

agreed that CmdStan’s argument parser is a world of pain and makes the Unix tar look easy - cf xkcd: tar

try this:

./mtc_equity_8 sample  num_warmup=1000  adapt delta=0.9 \
    data file=mtc_data_V3.json \
    output file=output_${i}.csv refresh=100 \
    random seed=76543 \
    num_threads=4

when I need to check the CmdStan argument ordering, I run a model and then look at the first 50 or so lines of the output.csv header. this reflects the way that CmdStan arg parser nests its arguments - it will look something like this:

# method = sample (Default)
#   sample
#     num_samples = 1000 (Default)
#     num_warmup = 1000 (Default)
#     save_warmup = 0 (Default)
#     thin = 1 (Default)
#     adapt
#       engaged = 1 (Default)
#       gamma = 0.050000000000000003 (Default)
#       delta = 0.80000000000000004 (Default)
#       kappa = 0.75 (Default)
#       t0 = 10 (Default)
#       init_buffer = 75 (Default)
#       term_buffer = 50 (Default)
#       window = 25 (Default)
#     algorithm = hmc (Default)
#       hmc
#         engine = nuts (Default)
#           nuts
#             max_depth = 10 (Default)
#         metric = diag_e (Default)
#         metric_file =  (Default)
#         stepsize = 1 (Default)
#         stepsize_jitter = 0 (Default)
#     num_chains = 1 (Default)
# id = 2
# data
#   file = test/data/eight_schools.data.R
# init = 2 (Default)
# random
#   seed = 55157
# output
#   file = /Users/mitzi/github/stan-dev/cmdstanpy/eight_schools-20220608073226_2.csv
#   diagnostic_file =  (Default)
#   refresh = 100 (Default)
#   sig_figs = -1 (Default)
#   profile_file = profile.csv (Default)
# num_threads = 1 (Default)

jfhawkin · June 16, 2022, 2:49pm

Thanks @mitzimorris!
I’ve made a few changes to my model and generated some additional diagnostics (but no solution). I’ll break it out for easier reading.

I changed the way I specify lognormal parameters in the model based on the specification by @Bob_Carpenter in a developer discussion here. You have to separate the non-centered parameterizing using offset/multiplier syntax from the lower limit constraint by putting the constraint in the transformed parameter block. I then pass those transformed parameters to my likelihood function.
I found out the data from my collaborators is a mess, so I had to redo the processing on it. I introduce a variable called avail which says whether a category is available for an observation. The standard way to include availability is to multiply the loglikelihood by 0/1. This is equivalent to adding log(0)/log(1) in the exp() term (i.e., exp(log(0))=0), which was simpler to implement in my Stan code.

functions {
  int[] sequence(int start, int end) { 
    int seq[end - start + 1];
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 

  real partial_log_lik_lpmf(int[] seq, int start, int end,
		     data int ncat, data int nvar, data int ntotvar, data int[] Y, data matrix x1, data matrix x2, data matrix x3, data matrix x4, data int[] J_1, data int[] J_2, 
                     vector b_a2, vector b_a3, vector b_a4, vector b_bc, vector b_b1tt, vector b_b2tt, vector b_b3tt, vector b_b4tt, data int[,] avail) {
                     
    real ptarget = 0;
    int N = end - start + 1;
    // Define matrices/vector for x, alpha, beta
    matrix[ncat,nvar] x;
    vector[ncat] alpha;
    matrix[nvar,ncat] beta;

    array[N] row_vector[ncat] a = rep_array(rep_row_vector(0,ncat), N);
      
    // initialize linear predictor term
    // Terms are btc, b1tt, b2tt, b3tt, b4tt

    array[N] row_vector[ntotvar] b = rep_array(rep_row_vector(0,ntotvar), N);

    for (n in 1:N) {
      int nn = n + start - 1;
      // add to linear predictor term
      a[n] += [0,b_a2[J_2[nn]], b_a3[J_2[nn]], b_a4[J_2[nn]]];
      // add to linear predictor term
      // Terms are btc (0), b1tt, b2tt, b3tt, b4tt
      b[n] += [b_bc[J_1[nn]], b_b1tt[J_1[nn]], b_b2tt[J_1[nn]], b_b3tt[J_1[nn]], b_b4tt[J_1[nn]]];
      // Each x and beta is a matrix with dimensions alts x variables
      // Our x will be the time/cost coming in as inputs
      x = to_matrix([x1[nn],x2[nn],x3[nn],x4[nn]]);

      // Our betas will be the hierarchical slope parameters (2 x 4)
      beta = -1*[b[n,1] * b[n,2:5], [b[n,1],0,0,b[n,1]]];
      // Our alphas will be the hierarchical intercept parameters. ln(avail[nn,i]] gives an availability measure
      alpha = [a[n,1]+log(avail[nn,1]), b[n,1] * a[n,2]+log(avail[nn,2]), b[n,1] * a[n,3]+log(avail[nn,3]), b[n,1] * a[n,4]+log(avail[nn,4])]';
      ptarget += categorical_logit_glm_lupmf(Y[nn] | x, alpha, beta);
    }
    return ptarget;
  }
}

data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int<lower=2> nvar;  // number of variables for each alternative
  int<lower=2> ntotvar;  // number of total variables
  int grainsize;  // grainsize for threading
  array[N] int Y;  // response variable
  // covariate matrix for alt1
  matrix[N,nvar] x1; // matrix[N,nvar] x1;
  // covariate matrix for alt2 
  matrix[N,nvar] x2; // matrix[N,nvar] x2;
  // covariate matrix for alt3 
  matrix[N,nvar] x3; // matrix[N,nvar] x3;
  // covariate matrix for alt4
  matrix[N,nvar] x4; // matrix[N,nvar] x4;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  array[N] int<lower=1> J_2;  // grouping indicator per observation
  array[N,ncat] int<lower=0,upper=1> avail; // whether an alternative is available
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int seq[N] = sequence(1, N);
}
parameters {
  real mu_b1tt; // population-level effects
  real mu_b2tt; // population-level effects
  real mu_b3tt; // population-level effects
  real mu_b4tt; // population-level effects
  real mu_a2; // population-level effects
  real mu_a3; // population-level effects
  real mu_a4; // population-level effects
  real<lower=0> sd1_bc;  // group-level standard deviations
  real<lower=0> sd1_b1tt;  // group-level standard deviations
  real<lower=0> sd1_b2tt;  // group-level standard deviations
  real<lower=0> sd1_b3tt;  // group-level standard deviations
  real<lower=0> sd1_b4tt;  // group-level standard deviations
  real<lower=0> sd2_a2;  // group-level standard deviations
  real<lower=0> sd2_a3;  // group-level standard deviations
  real<lower=0> sd2_a4;  // group-level standard deviations
  vector<offset=mu_bc, multiplier = sd1_bc>[N_1] log_b_bc;
  vector<offset=mu_b1tt, multiplier = sd1_b1tt>[N_1] log_b_b1tt;
  vector<offset=mu_b2tt, multiplier = sd1_b2tt>[N_1] log_b_b2tt;
  vector<offset=mu_b3tt, multiplier = sd1_b3tt>[N_1] log_b_b3tt;
  vector<offset=mu_b4tt, multiplier = sd1_b4tt>[N_1] log_b_b4tt;
  vector<offset=mu_a2, multiplier = sd2_a2>[N_2] b_a2;
  vector<offset=mu_a3, multiplier = sd2_a3>[N_2] b_a3;
  vector<offset=mu_a4, multiplier = sd2_a4>[N_2] b_a4;
}
transformed parameters{
	vector<lower=0>[N_1] b_bc = exp(log_b_bc);
	vector<lower=0>[N_1] b_b1tt = exp(log_b_b1tt);
	vector<lower=0>[N_1] b_b2tt = exp(log_b_b2tt);
	vector<lower=0>[N_1] b_b3tt = exp(log_b_b3tt);
	vector<lower=0>[N_1] b_b4tt = exp(log_b_b4tt);
}
model {
  // likelihood including constants
  if (!prior_only) {
       target += reduce_sum(partial_log_lik_lpmf, seq,
                     grainsize, ncat, nvar, ntotvar, Y, x1, x2, x3, x4, J_1, J_2,
		        b_a2, b_a3, b_a4, b_bc, b_b1tt, b_b2tt, b_b3tt, b_b4tt, avail);

  }
  // priors - updated based on prior knowledge of general range (willingness-to-pay perhaps $5-$75 per hour)
  mu_b1tt ~ normal(3,1);
  mu_b2tt ~ normal(3,1);
  mu_b3tt ~ normal(3,1);
  mu_b4tt ~ normal(2,1);
  mu_a2 ~ normal(3,1);
  mu_a3 ~ normal(3,1);
  mu_a4 ~ normal(3,1);
  sd1_bc ~ std_normal();
  sd1_b1tt ~ normal(2,1);
  sd1_b2tt ~ normal(2,1);
  sd1_b3tt ~ normal(2,1);
  sd1_b4tt ~ normal(2,1);
  sd2_a2 ~ normal(1,5);
  sd2_a3 ~ normal(1,5);
  sd2_a4 ~ normal(1,5);

  b_a2 ~ normal(mu_a2, sd2_a2);
  b_a3 ~ normal(mu_a3, sd2_a3);
  b_a4 ~ normal(mu_a4, sd2_a4);
  // Mean must be zero (exp(0)=1 median) for identification (i.e., only estimate variance on scale/b_bc term)
  log_b_bc ~ normal(0, sd1_bc);
  log_b_b1tt ~ normal(mu_b1tt, sd1_b1tt);
  log_b_b2tt ~ normal(mu_b2tt, sd1_b2tt);
  log_b_b3tt ~ normal(mu_b3tt, sd1_b3tt);
  log_b_b4tt ~ normal(mu_b4tt, sd1_b4tt);
}

The parameter plots follow. I’m getting more clustering of the posterior within a single region but still get some odd histogram plots. I’m running the model again with standard normal priors because I think these parameters are well outside where they should be and it may be coming from using N(0,5) priors that are reasonable on the unconstrained scale but huge when log transformed (e.g., exp(5) = 148)

divergence_plot_11117×750 77 KB

mu_plot_11117×750 9.34 KB

sigma_plot_11117×750 9.2 KB
I’ve run a simple model without hierarchy using MLE optimization using a different package and get reasonable parameter estimates. I tried running the Stan code using optimize(). I tried a few different priors/init specifications, but I consistently get the following error.

Exception: offset_multiplier_constrain: multiplier is 0, but must be positive finite! (in '/tmp/Rtmp1OXpkn/model-11fa68cd532a.stan', line 101, column 2 to column 64) 
Error evaluating model log probability: Non-finite gradient. 
Error evaluating model log probability: Non-finite gradient. 
Error evaluating model log probability: Non-finite gradient. 
Error evaluating model log probability: Non-finite gradient. 
Finished in  3.3 seconds.

I receive a similar error when running sample() with cmdstanr and cmdstan. The cmdstanr version runs to the end and the cmdstan code also fails to run. I’ve kept going with the model based on the developer discussion here. My model stops giving the error/information message after about 50 instances, so I thought it was solved within the first part of warmup.
5. I’ve been trying to run the cmdstan version with additional inputs of either init or metric from previous runs with cmdstanr to see if that gets it running (i.e., by providing a starting point that doesn’t generate “multiplier is 0” errors using cmdstanr). I wasn’t sure it was worth trying to get that running vs. continuing with cmdstanr, which appears to run fine. I even copied over the first lines from the csv output from cmdstan and used the full list of options. I get an error that an = is misplaced.

./mtc_equity_12 method = sample \
  sample \
    num_samples = 1000 \
    num_warmup = 1000 \
    save_warmup = 0 \
    thin = 1 \
    adapt \
      engaged = 1 \
      gamma = 0.050000000000000003 \
      delta = 0.9 \
      kappa = 0.75 \
      t0 = 10 \
      init_buffer = 75 \
      term_buffer = 50 \
      window = 25 \
    algorithm = hmc \
      hmc \
        engine = nuts \
          nuts \
            max_depth = 10 \
        metric = diag_e \
        metric_file = \
        stepsize = 1 \
        stepsize_jitter = 0 \
id = 0 \
data \
  file = mtc_data.json \
init = my_inits12.json \
random \
  seed = 76543 \
output \
  file = output_12_${i}.csv \
  refresh = 100 \
num_threads = 32 \
opencl \
  device = 0 \
  platform = 0 \
mpi_enabled = 1
stancflags = --use-opencl

jfhawkin · June 20, 2022, 2:59pm

My current model is given below with the same plots I’ve been generating to diagnose problems. I’m clearly stuck…

I’ve been running some tests using optimize(). I started with the simplest model - population parameters only. This converged, so I progressively added hierarchy back into the model. It runs with b_a1, b_a2, b_a3, b_a4, and b_bc being hierarchical parameters. When I add b_b1tt, etc., it gives a line search failure error. I tried only adding hierarchy for one of the b_b#tt and a few other combinations. In theory, the model should run (if specified as I believe it to be). I’m working off a published model with the same structure - i.e., all random/hierarchical parameters and a multiplication by a common hierarchical parameter (in my model b_bc), which captures scale heterogeneity/cost. You cannot separate the cost parameter from the scale and its median should be fixed to 1.0 for identification. Others also use lognormal priors for these parameters.

Current data and R script:
mtc_equity-v13.R (2.3 KB)
time_cost_calculated.CSV (2.4 MB)

functions {
  int[] sequence(int start, int end) { 
    int seq[end - start + 1];
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 

  real partial_log_lik_lpmf(int[] seq, int start, int end,
		     data int ncat, data int nvar, data int ntotvar, data int[] Y, data matrix x1, data matrix x2, data matrix x3, data matrix x4, data int[] J_1, data int[] J_2, 
                     vector b_a2, vector b_a3, vector b_a4, vector b_bc, vector b_b1tt, vector b_b2tt, vector b_b3tt, vector b_b4tt, matrix avail) {
                     
    real ptarget = 0;
    int N = end - start + 1;
    // Define matrices/vector for x, alpha, beta
    matrix[ncat,nvar] x;
    vector[ncat] alpha;
    matrix[nvar,ncat] beta;

    array[N] row_vector[ncat] a = rep_array(rep_row_vector(0,ncat), N);
      
    // initialize linear predictor term
    // Terms are btc, b1tt, b2tt, b3tt, b4tt

    array[N] row_vector[ntotvar] b = rep_array(rep_row_vector(0,ntotvar), N);

    for (n in 1:N) {
      int nn = n + start - 1;
      // add to linear predictor term
      a[n] += [0,b_a2[J_2[nn]], b_a3[J_2[nn]], b_a4[J_2[nn]]];
      // add to linear predictor term
      // Terms are btc, b1tt, b2tt, b3tt, b4tt
     // Median btc parameter fixed to 1.0 for identification
      b[n] += [b_bc[J_1[nn]], b_b1tt[J_1[nn]], b_b2tt[J_1[nn]], b_b3tt[J_1[nn]], b_b4tt[J_1[nn]]];
      // Each x and beta is a matrix with dimensions alts x variables
      // Our x will be the time/cost coming in as inputs
      x = to_matrix([x1[nn],x2[nn],x3[nn],x4[nn]]);

      // Our betas will be the hierarchical slope parameters (2 x 4)
     // Time and cost parameters multiplied by -1 because all have negative effect (positive lognormal transformed to negative lognormal)
      beta = -1*[b[n,1] * b[n,2:5], [b[n,1],0,0,b[n,1]]];
      // Our alphas will be the hierarchical intercept parameters. ln(avail[nn,i]] gives an availability measure
      alpha = [a[n,1]+log(avail[nn,1]), b[n,1] * a[n,2]+log(avail[nn,2]), b[n,1] * a[n,3]+log(avail[nn,3]), b[n,1] * a[n,4]+log(avail[nn,4])]';
     // Y f(alpha+betaX)
      ptarget += categorical_logit_glm_lupmf(Y[nn] | x, alpha, beta);
    }
    return ptarget;
  }
}

data {
  int<lower=1> N;  // total number of observations
  int<lower=2> ncat;  // number of categories
  int<lower=2> nvar;  // number of variables for each alternative
  int<lower=2> ntotvar;  // number of total variables
  int grainsize;  // grainsize for threading
  array[N] int Y;  // response variable
  // covariate matrix for alt1
  matrix[N,nvar] x1; // matrix[N,nvar] x1;
  // covariate matrix for alt2 
  matrix[N,nvar] x2; // matrix[N,nvar] x2;
  // covariate matrix for alt3 
  matrix[N,nvar] x3; // matrix[N,nvar] x3;
  // covariate matrix for alt4
  matrix[N,nvar] x4; // matrix[N,nvar] x4;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  array[N] int<lower=1> J_1;  // grouping indicator per observation
  // group-level predictor values
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  array[N] int<lower=1> J_2;  // grouping indicator per observation
  matrix<lower=0,upper=1>[N,ncat] avail; // whether an alternative is available
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int seq[N] = sequence(1, N);
}
parameters {
  real mu_b1tt; // population-level effects
  real mu_b2tt; // population-level effects
  real mu_b3tt; // population-level effects
  real mu_b4tt; // population-level effects
  real mu_a2; // population-level effects
  real mu_a3; // population-level effects
  real mu_a4; // population-level effects
  real<lower=0> sd1_bc;  // group-level standard deviations
  real<lower=0> sd1_b1tt;  // group-level standard deviations
  real<lower=0> sd1_b2tt;  // group-level standard deviations
  real<lower=0> sd1_b3tt;  // group-level standard deviations
  real<lower=0> sd1_b4tt;  // group-level standard deviations
  real<lower=0> sd2_a2;  // group-level standard deviations
  real<lower=0> sd2_a3;  // group-level standard deviations
  real<lower=0> sd2_a4;  // group-level standard deviations
  vector<offset=0, multiplier = sd1_bc>[N_1] log_b_bc;
  vector<offset=mu_b1tt, multiplier = sd1_b1tt>[N_1] log_b_b1tt;
  vector<offset=mu_b2tt, multiplier = sd1_b2tt>[N_1] log_b_b2tt;
  vector<offset=mu_b3tt, multiplier = sd1_b3tt>[N_1] log_b_b3tt;
  vector<offset=mu_b4tt, multiplier = sd1_b4tt>[N_1] log_b_b4tt;
  vector<offset=mu_a2, multiplier = sd2_a2>[N_2] b_a2;
  vector<offset=mu_a3, multiplier = sd2_a3>[N_2] b_a3;
  vector<offset=mu_a4, multiplier = sd2_a4>[N_2] b_a4;
}
// Transform normal to lognormal (all time and cost parameters should be single signed)
transformed parameters{
	vector<lower=0>[N_1] b_bc = exp(log_b_bc);
	vector<lower=0>[N_1] b_b1tt = exp(log_b_b1tt);
	vector<lower=0>[N_1] b_b2tt = exp(log_b_b2tt);
	vector<lower=0>[N_1] b_b3tt = exp(log_b_b3tt);
	vector<lower=0>[N_1] b_b4tt = exp(log_b_b4tt);
}
model {
  // likelihood including constants
  if (!prior_only) {
       target += reduce_sum(partial_log_lik_lpmf, seq,
                     grainsize, ncat, nvar, ntotvar, Y, x1, x2, x3, x4, J_1, J_2,
		        b_a2, b_a3, b_a4, b_bc, b_b1tt, b_b2tt, b_b3tt, b_b4tt, avail);

  }
  // priors
  mu_b1tt ~ normal(3,1); // Gives reasonable values of time (~$5-$75)
  mu_b2tt ~ normal(3,1);
  mu_b3tt ~ normal(3,1);
  mu_b4tt ~ normal(2,1);
  mu_a2 ~ normal(3,1);
  mu_a3 ~ normal(3,1);
  mu_a4 ~ normal(3,1);
  sd1_bc ~ std_normal();
  sd1_b1tt ~ normal(2,1);
  sd1_b2tt ~ normal(2,1);
  sd1_b3tt ~ normal(2,1);
  sd1_b4tt ~ normal(2,1);
  sd2_a2 ~ normal(1,5);
  sd2_a3 ~ normal(1,5);
  sd2_a4 ~ normal(1,5);
  b_a2 ~ normal(mu_a2, sd2_a2);
  b_a3 ~ normal(mu_a3, sd2_a3);
  b_a4 ~ normal(mu_a4, sd2_a4);
  log_b_bc ~ normal(0, sd1_bc);
  log_b_b1tt ~ normal(mu_b1tt, sd1_b1tt);
  log_b_b2tt ~ normal(mu_b2tt, sd1_b2tt);
  log_b_b3tt ~ normal(mu_b3tt, sd1_b3tt);
  log_b_b4tt ~ normal(mu_b4tt, sd1_b4tt);
}

mu_b#tt in the range of 4.0 are ok (a bit high) when combined with an sd of about 0-3.

jfhawkin · June 27, 2022, 9:57pm

I have yet to obtain a working model, but I want to give a final summary on this topic because my questions are starting to move shift towards being model-specific rather than a generic question about divergences/priors.

I estimated several models last week that helped me to better understand the effect of different inputs. I read some additional suggestions on zero avoiding priors (zap). As recommended in a few places, I tried a gamma prior rather than lognormal. I switched to a student t prior on my scale parameter (I realize it’s not technically a standard deviation with the gamma, but I kept the naming convention from my previous model iterations), which seems to help a bit. I also tried combining two parameters that theoretically make sense to test as a single parameter, which didn’t help.

mu_b1tt ~ normal(3,0.25);
sd1_b1tt ~ student_t(3,0,1);
b_b1tt ~ gamma(mu_b1tt, sd1_b1tt);

I tried some tighter priors, additional iterations, a higher max_treedepth, and a higher adapt_delta. See below for how this changes my posterior plots.

Normal sd priors (25% divergent transitions)

student-t sd priors and adapt_delta=0.99 (up from 0.95) (11% divergent transitions)

gamma priors (~0% (3 total) divergent transitions)

Topic		Replies	Views
Divergent transitions Modeling	12	998	July 17, 2019
Divergences in a non-centered computational model Modeling fitting-issues	21	1421	October 30, 2019
Divergent transitions in hierarchical model Modeling fitting-issues	26	1988	November 7, 2019
High number of divergent transitions in ODE model with explicit constraints Modeling ecology , population-dynamics	55	3427	June 5, 2018
Stan do not use my init and divergent transition issue Modeling	9	1117	June 14, 2019

Categorical model divergences/priors

Related topics