Simple Linear Regression with Categorical Predictor (species)

Hi all,

I am trying to specify multilevel model in which the first level estimates a species-specific slope and intercept across species, and the second level uses those species-specific coefficients as data. Thus, I have different numbers of data points for each model. I am running into problems at the first level, where I am trying to estimate species-specific coefficients. I think it has something to do with the way I specify the ‘species’ predictor.

To specify the first level, I followed this post: https://stackoverflow.com/questions/29183577/how-to-represent-a-categorical-predictor-rstan, where I create a data matrix, but the issue is when I use loo() on the model object. No matter what I seem to do, all my pareto K values are “very bad”. I have pasted the code below:

data_matrix  <-  model.matrix(y1 ~ x * Species,  data = df)

dat <- list(N = nrow(df),
		    K = ncol(data_matrix),
		   data_matrix = data_matrix,
		    y1 = df$y1,
		    y2 = unique(df$y2))


data {
      int N; // number of observations
      vector[N] y1; // response variable 1 
      vector[33] y2; // response variable 2
      int K; // number of predictors
      matrix[N, K] data_matrix; // matrix of predictors
      }

  parameters {
    
      // level 1
      vector[K] beta;
      real<lower=0> sigma;
      
      // level 2
      real y2_int;
      real y2_slope;
      real<lower=0> sigma_y2;

      }

  transformed parameters {

      vector[N]  mu_y1;
      real beta_y1_ref;
      vector[33] beta_y1;
      vector[37] mu_y2;
    
      mu_y1 = data_matrix * beta; 
      
      beta_y1_ref = beta[1];
      
      beta_y1[1] = beta_y1_ref;
      beta_y1[2:33] = beta_y1_ref + beta[3:34];
     
      mu_y2  =  y2_int + y2_slope * beta_y1;
      }

  model {

      beta            ~ student_t(3, 0, 10);
      sigma           ~ cauchy(0, 10);   
      
      y1 ~ normal(mu_y1, sigma);
      
      y2_int         ~ student_t(1, 0, 10);
      y2_slope       ~ student_t(1, 0, 10);
      sigma_y2       ~ cauchy(0, 10);
      
      y2 ~ normal(mu_y2, sigma_y2); 
      
      }


  generated quantities {

      vector[33] log_lik;

      for(i in 1:33){
      log_lik[i] =
      normal_lpdf(y1[i] | mu_y1, sigma) +
      normal_lpdf(y2[i] | mu_y2, sigma_y2);
      }
      }

When I run the above model, the parameter estimates for the first level match that from lm() or brms(). However, if you do loo(model object), I get all pareto K as “very bad”. This happens even when running this model without the second level (y2 ~ etc). I am also not sure if I have specified the log_lik correctly (I have only specified multi-level models that have the same number of observations/levels previously).

Any help would be greatly appreciated. Thank you!

Can you post your model call and a snippet of your data or simulated data? That will help with the troubleshooting.

I’ve not dug through your model code, but bad Pareto K values from loo do not (necessarily) indicate that there’s any problem with your model. It just means that for your model loo cannot provide a reliable approximation of leave-one-out cross-validation (and therefore neither can waic or other common model selection diagnostics). This can arise due to problems with the model, and it means that your inference might be especially sensitive to model misspecification. However, it can also arise in well-specified models that yield reliable inference, in which case it probably reflects that some (or many) of your data points are highly influential.

3 Likes

Thanks @Ara_Winter and @jsocolar! I wonder if @paul.buerkner can answer this as it may be loo (although the issue surely lies with my code)?

I tried to troubleshoot a little more and found something interesting…I think the issue is when I try and extract the coefficients. Here is a snippet of the data & model call, and the part of the stan file where I think I am going wrong:

data_matrix  <- model.matrix(y ~ x * Species, data = df)

dat <- list(N=nrow(df),
		    y=df$y,
		    K = ncol(data_matrix),
		   data_matrix = data_matrix)

stan(file = file.stan"),
                data = dat,
                iter = 5000,
                warmup = 100,
                chains = 4,
                control = list("adapt_delta" = 0.99,
                			  "max_treedepth" = 18),
                pars = c( "beta",
                                 "log_lik",
                                 "sigma"))
  data {
      int N; // number of observations
      vector[N] y; // response 1 (respiratory surface area)
      int K; // number of predictors
      matrix[N, K] data_matrix; // matrix of predictors
      
      }

  parameters {
    
      // level 1
      vector[K] beta;
      real<lower=0> sigma;
      
      }

  transformed parameters {

      vector[N]  mu_y;
      real beta_int_ref;
      real beta_slope_ref;
      vector[K] beta_ints;
      vector[K] beta_slopes;

      mu_y = data_matrix * beta; 
      
      beta_int_ref = beta[1];
      beta_slope_ref = beta[2];
      
      beta_ints[1] = beta_int_ref;
      beta_ints[2:32] = beta_int_ref + beta[3:33];
      
      beta_slopes[1] = beta_slope_ref;
      beta_slopes[2:32] = beta_slope_ref + beta[34:64];
    
    
      }

  model {

      beta                ~ student_t(3, 0, 10);
      sigma               ~ cauchy(0, 10);   
      
      y ~ normal(mu_y, sigma);
      
      }


  generated quantities {

      vector[K] log_lik;

      for(i in 1:K){
      log_lik[i] =
      normal_lpdf(y[i] | mu_y, sigma);
      }}

When I run the model without the following:

      beta_int_ref = beta[1];
      beta_slope_ref = beta[2];
      
      beta_ints[1] = beta_int_ref;
      beta_ints[2:32] = beta_int_ref + beta[3:33];
      
      beta_slopes[1] = beta_slope_ref;
      beta_slopes[2:32] = beta_slope_ref + beta[34:64];

There is no problem when I run loo(model_object). However, if I include the above code to extract the coefficients and run loo(model_object), I get:

Warning message:
Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

Computed from 19600 by 64 log-likelihood matrix

           Estimate        SE
elpd_loo -9457839.5  608721.4
p_loo     4195262.3  268556.0
looic    18915678.9 1217442.7
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count      Pct.    Min.       n_eff
(-Inf, 0.5]   (good)          0         0.0%     <NA>      
 (0.5, 0.7]   (ok)               0         0.0%     <NA>      
   (0.7, 1]     (bad)            0         0.0%     <NA>      
   (1, Inf)     (very bad)   64      100.0%   0         
See help('pareto-k-diagnostic') for details.

Without the code to extract coefficients:

Computed from 19600 by 64 log-likelihood matrix

        Estimate   SE
elpd_loo     52.7  7.6
p_loo        10.2  2.3
looic      -105.4 15.1
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     58    90.6%   2723      
 (0.5, 0.7]   (ok)        6     9.4%   101       
   (0.7, 1]   (bad)       0     0.0%   <NA>      
   (1, Inf)   (very bad)  0     0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details. 

Now I need to extract the coefficients because my full model has another level, which uses them as data. The reason I set up the code this way was to ensure that the coefficient values
were not relative to the first group…

On another note, I’m not sure I am correctly specifying the likelihood – shouldn’t it be equal to the number of groups (species) as the way I have parameterized the model is essentially a separate model for each species?

Any ideas?

Thanks!!

2 Likes