Decomposing Regression Coefficient in Non-Linear Model

I’ve got a model that I’m trying to figure out how to fit in brms if possible. The general idea comes from a specific use case of a linear logistic test model from item response theory. To walk through this quickly, I’m going to attempt to write out the formulas (apologies if these don’t abide by appropriate standard notations, but hopefully they are at least accurate).

The linear logistic test model is usually treated as an extension of the one-parameter logistic regression, which is shown on the logit scale as:

\eta_{ij} = \theta_j + \beta_i

The log-odds, \eta, of person j answering item i correctly is a function of the person’s ability, \theta_j, and the item easiness, \beta_i. The linear logistic test model arises when the item easiness parameter is estimated from covariates such that,

\beta_i = \Sigma_{1}^{k}\omega_kX_k+\delta_i

Here, the item easiness parameter is just a function of various item covariates, X_k, that each have their own coefficients, \omega_k, and then each item can have its own residual estimated by a random intercept, \delta_i

My understanding is that the linear logistic test model would be specified in brms using the following formula:

llt1pm <- brm(resp ~ cov1 + cov2 + cov3 + (1 | item) + (1 | id),
              data = df, family = bernoulli(link = "logit"))

Extending this to the two-parameter logistic model is straightforward with brms as well:

llt2pm <- brm(bf(resp ~ exp(logalpha)*theta + beta, nl = TRUE,
                 theta ~ 0 + (1 | id),
                 logalpha ~ 1 + cov1 + cov2 + cov3 + (1 | item),
                 beta ~ 1 + cov1 + cov2 + cov3 + (1 | item)),
              data = df, family = bernoulli(link = "logit"))

I’m interested in something akin to differential item functioning but at the level of the item covariates. So, the equation for a one-parameter model would be something like this:

\eta_{ijg} = \theta_j + \omega_gG+ \omega_zZ_j + \delta_i +\omega_1X_1 + \omega_2X_2 + (\delta_j + \omega_gG + \omega_zZ_j)X_3

In effect, the item covariates (X_k) still operate as normal while there is still a permitted random residual associated with each item, \delta_i, but the coefficient or effect for one of the item covariates (X_3 in this case) is permitted to vary over persons. Specifically, it is of interest the extent to which differences in the effect of the item covariate can be separated into an effect due to a person-covariate, Z_j, a group-indicator, G, and then a residual person-level term, \delta_j. Since the person-level covariates are included in the item covariate effect estimation, these same parameters are included as part of the base model to account for potential between-person differences in \theta that may be due to the person-covariate and group indicator. Also, the group indicator is not a dummy variable but is instead coded as -1 and 1 since the emphasis is on interpreting the effect in terms of the difference between two groups.

I cannot seem to find a way to specify this model in brms(). I’m more interested in the two-parameter extension of the formula as well. My first thought was that the specification should look something like this:

llt2pm <- brm(bf(resp ~ exp(logalpha)*theta + beta, nl = TRUE,
                 theta ~ z + g + (1 | id),
                 logalpha ~ 1 + cov1 + cov2 + ba*cov3 + (1 | item),
                 beta ~ 1 + cov1 + cov2 + bb*cov3 + (1 | item),
                 ba ~ z + g + (1 | id),
                 bb ~ z + g + (1 | id)),
              data = df, family = bernoulli(link = "logit"))

This, of course, does not work because of the combined linear and non-linear components on the beta and logalpha parameters. I cannot, however, figure out how to properly specify the model so that brms will run the model.

Does anyone know how to specify this model in brms? If it’s not possible, then I’m also open to other thoughts about fitting the model.

There might be an easier way to do this in brms, but something that should work is to define a custom family that takes distributional parameters theta, logalpha, beta, ba, and bb, and vreal component cov3. Let’s rename theta as mu. Then you could write something like (this is untested and quite possibly buggy, but it might get you on the right track):

resp | vreal(cov3) ~ z + g + (1 | id), # this is the theta piece, but for brms we have to call something `mu`
logalpha ~ 1 + cov1 + cov2 + (1 | item), # this is the logalpha piece except for the ba term
beta ~ 1 + cov1 + cov2 + (1 | item), # the beta piece except for the bb term
ba ~ z + g + (1 | id),
bb ~ z + g + (1 | id))

In the definition of the custom family, we’d have something like

real p  = exp(logalpha[i] + ba[i] * vreal1[i]) * mu[i] + beta[i] + bb[i] * vreal1[i];
target += bernoulli_logit_lpmf(y[i] | p);

Realistically, it would probably be easier to just write this model in pure Stan rather than to wrangle this custom family. The reason to do this in brms would be if you want the ability to rapidly and flexibly experiment with alternative specification for the various linear predictors.

1 Like

I figured that something outside the standard brms formulas would be needed in this case. I’m always surprised by the flexibility of the package. I suspect that you’re right in the conclusion that coding directly in Stan is probably the simpler option. I’d wanted to use brms mostly because (a) I’m not super confident in writing multilevel non-linear regressions and (b) I’d like to use the reduce_sum() function to get multithreading going for the model, which brms does automagically

I know that there are quite a few examples of IRT models in Stan out there, so I’ll probably adapt code for that. Any recommendations about speed up options to pursue? I’ve seen some talk about the _glm functions and there being circumstances in which they may or may not be time saving. I’m sure I can get a clunky version of the model going, but it might help of potential directions to take the model for efficiency

I’ve transitioned from brms to cmdstanr and am just writing the model outright now. I’ve run into an issue with trying to pass along relevant prior information, though. So, I’m open to any recommendations for how to approach this

For illustrative purposes, here’s a simplified model that still gets the idea across:

data {
  int<lower=1> I;                      // number of items
  int<lower=1> J;                      // number of persons
  int<lower=1> N;                      // number of observations
  array[N] int<lower=1, upper=I> ii;   // indicator for item on observation N
  array[N] int<lower=1, upper=J> jj;   // indicator for person on observation N
  array[N] int<lower=0, upper=1> y;    // indicator for response on observation N
  int<lower=1> K_b;                    // difficulty covariates
  matrix[I, K_b] W_b;                  // difficulty covariate matrix
  matrix[2, I] prior_logalpha;         // informative priors for log-scaled alpha parameter
}

parameters {
  vector[J] theta;                     // latent ability parameters
  vector[I] logalpha_c;                // centered log-scaled discrimination parameters
  vector[K_b] coef_b;                  // coefficients for item difficulty parameters
}

transformed parameters {
  // linear prediction of item difficulty
  vector[I] beta = W_b * coef_b;

  // rescale item discrimination parameters
  vector[I] alpha = exp(logalpha_c * prior_logalpha[2, ] + prior_logalpha[1, ]);
}

model {
  // some priors
  coef_b ~ std_normal();
  theta ~ std_normal();
  logalpha_c ~ std_normal();
  
  // likelihood
  {
    vector[N] eta = alpha[ii] .* theta[jj] - beta[ii];
    y ~ bernoulli_logit(eta);
  }
}

The problem that I’m running into, and which is hopefully illustrated in the above (untested) code, is that I have some prior information about the parameter values that should be included in the model. These priors are normal approximations using the mean and standard deviation of posteriors from a standard IRT model that just estimates the parameters outright, no explanatory covariates or anything.

The way I’m currently thinking about dealing with this is to specify the coefficients in the linear prediction of the parameters as a simplex so that the coefficients are essentially proportions of parameter of interest. [EDIT: After posting it occurred to me that the simplex is strictly positive, so this will not work as intended. Maybe declaring a unit vector instead?] I think this could then be scaled something like the follow:

data {

  ... other stuff ...

  // informative prior for beta parameter
  matrix[2, I] prior_beta;

}

parameters {

  ... other stuff ...

  // unscaled difficulty parameter
  vector[I] beta_c;

  // simplex coefficient
  simplex[K_b] coef_b;

}

transformed parameters {

  ... other stuff ...

  // rescale coefficients
  vector[I] beta = (coef_b * W_b) * (beta_c * prior_beta[2, ] + prior_beta[1, ]);

}

model {

  ... still other stuff ...

  // prior on unscaled difficulty parameter
  beta_c ~ normal_std();

}

I’m not sure whether the simplex option is doing what I want it to in this case. My fear is that this approach (a) fails to rescale everything like I want it to and/or (b) makes the prior on the IRT parameter way more informative than I want it to be.

I do plan on adding in the random effects as well, so that should help to adjust the parameter estimate away from just reflecting the mean of the prior, though my hope is that the prior on beta_c would also add some wiggle around the prior if needed. I also don’t have a clear idea of how to extract the rescaled coefficients since the only recorded parameter is the simplex version.

Surely there’s an easier way that I’m not thinking of, so open to any and all recommendations!

Just want to find an entry point to an answer here:
Is it reasonable to expect that all of the coefficients will be between 0 and 1? Why?

Shortly after posting the previous update, I started working on the problem from a different angle. In essence, the approach that I’m taking now is to compute the linear prediction component of the model and placing a prior on this constructed value. It’s long, but here’s the full Stan code that I’m currently working with:

functions {
    
    // scale random variables that are correlated
    matrix scale_rand(matrix std, vector sigma, matrix omega) {
        return transpose(diag_pre_multiply(sigma, omega) * std);
    }
    
} // functions block

data {
    
    // long format item response theory inputs  
    int<lower=1> I;                              // number of items
    int<lower=1> J;                              // number of persons
    int<lower=1> N;                              // number of observations (I * J)
    array[N] int<lower=1, upper=I> ii;           // index of items
    array[N] int<lower=1, upper=J> jj;           // index of persons
    array[N] int<lower=0, upper=1> y;            // responses
    
    // latent linear trait model inputs
    int<lower=0> K_beta_b;                       // number of fixed predictors on difficulty
    int<lower=0> K_beta_a;                       // number of fixed predictors on discrimination
    matrix[I, K_beta_b] X_b;                     // covariate matrix for difficulty
    matrix[I, K_beta_a] X_a;                     // covariate matrix for discrimination
    
    // latent regression model inputs
    int<lower=0> K_beta_j;                       // number of fixed predictors on ability
    matrix[J, K_beta_j] X_j;                     // covariate matrix for ability
    
    // random item model inputs
    int<lower=0> K_rnd_b;                        // number of random covariates on difficulty
    int<lower=0> K_rnd_a;                        // number of random covariates on discrimination
    matrix[I, K_rnd_b] R_b;                      // covariate matrix for difficulty
    matrix[I, K_rnd_a] R_a;                      // covariate matrix for difficulty
    
    // random person inputs
    int<lower=1> K_rnd_j;                        // number of random covariates on ability (must include intercept)
    matrix[J, K_rnd_j] R_j;                      // covariate matrix for ability
    
    // item-by-person covariates
    int<lower=0> K_ibj_b;                        // number of item-by-person covariates on difficulty
    int<lower=0> K_ibj_a;                        // number of item-by-person covariates on discrimination
    matrix[N, K_ibj_b] X_b_j;                    // item-by-person covariate matrix for difficulty
    matrix[N, K_ibj_a] X_a_j;                    // item-by-person covariate matrix for discrimination
    
    // pervasive bias model inputs
    int<lower=0> K_bias_j;                       // number of explanatory person-variables
    int<lower=0> K_bias_b;                       // number of difficulty item covariates to evaluate
    int<lower=0> K_bias_a;                       // number of discrimination item covariates to evaluate
    array[K_bias_j] int bias_j;                  // column number(s) for person variable(s) (bias_j[1] *must* be the group variable)
    array[K_bias_b] int bias_b;                  // column number(s) for difficulty variable(s)
    array[K_bias_a] int bias_a;                  // column number(s) for discrimination variable(s)
    
    // informative priors
    matrix[4, I] prior_i;                        // item priors (rows = mu_beta, sd_beta, mu_logalpha, sd_logalpha; columns = items)
    vector[2] prior_j;                           // person priors (mu_theta, sd_theta)
    
} // data block

transformed data {
    
    // identifiers for random items effects by parameter
    array[(K_rnd_b + (K_bias_b > 0))] int id_rnd_b;
    array[(K_rnd_a + (K_bias_a > 0))] int id_rnd_a;
    
    // identifiers for random person effects by parameter
    array[K_rnd_j] int id_rnd_j;
    array[K_bias_b] int id_bias_b;
    array[K_bias_a] int id_bias_a;
    
    // combine person random effects into single matrix
    matrix[J, (K_rnd_j + K_bias_b + K_bias_a)] R_j_c;
    
    // create random item effect identifiers
    for ( k in 1 : (K_rnd_b + (K_bias_b > 0)) ) id_rnd_b[k] = k;
    for ( k in 1 : (K_rnd_a + (K_bias_a > 0)) ) id_rnd_a[k] = k + (K_rnd_b + (K_bias_b > 0));
    
    // create random person effect identifiers
    for ( k in 1 : K_rnd_j ) id_rnd_j[k] = k;
    for ( k in 1 : K_bias_b ) id_bias_b[k] = k + K_rnd_j;
    for ( k in 1 : K_bias_a ) id_bias_a[k] = k + K_rnd_j + K_bias_b;
    
    // create person random effects matrix
    {
        matrix[J, (K_rnd_j + K_bias_b)] R_j_temp = append_col(R_j, rep_matrix(rep_vector(1.0, J), K_bias_b));
        R_j_c = append_col(R_j_temp, rep_matrix(rep_vector(1.0, J), K_bias_a));
    } // local environment
    
    // convenience count of random effects on items
    int n_rnd_j = K_rnd_j + K_bias_b + K_bias_a;
    
    // convenience count of random effects on items
    int n_rnd_i = K_rnd_b + K_rnd_a + (K_bias_b > 0) + (K_bias_a > 0);
    
    // index for all covariates examined outside of pervasive bias
    array[(K_beta_b - K_bias_b)] int unbias_b;
    array[(K_beta_a - K_bias_a)] int unbias_a;
    
    {
        int b_b = 1;                             // initialize indicator on difficulty included in bias
        int b_a = 1;                             // initialize indicator on discrimination included in bias
        int f_b = 1;                             // initialize indicator on difficulty excluded from bias
        int f_a = 1;                             // initialize indicator on discrimination excluded from bias
        
        if ( K_bias_b > 0 ) {
            for ( k in 1 : K_beta_b ) {
                
                if ( k == bias_b[b_b] ) {
                b_b += 1;                        // increment indicator for next biased variable
                } else {
                    unbias_b[f_b] = k;
                    f_b += 1;                    // increment indicator for next non-biased variable
                } // if statement
                
            } // for loop
        } else {
            
            for ( k in 1 : K_beta_b ) unbias_b[k] = k;
        } // if statement
        
        if ( K_bias_a > 0 ) {
            
            for ( k in 1 : K_beta_a ) {
                
                if ( k == bias_a[b_a] ) {
                    b_a += 1;                        // increment indicator for next biased variable
                    } else {
                        unbias_a[f_a] = k;
                        f_a += 1;                    // increment indicator for next non-biased variable
                    } // if statement
                        
                } // for loop
        } else {
            
            for ( k in 1 : K_beta_a ) unbias_a[k] = k;
        } // if statement
        
    } // local environment
    
} // transformed data block

parameters {
    
    // latent linear trait parameters
    vector[(K_beta_b - K_bias_b)] beta_b_raw;    // coefficients for difficulty, non-centered parameterization
    vector[(K_beta_a - K_bias_a)] beta_a_raw;    // coefficients for discrimination, non-centered parameterization
    
    // latent regression parameters
    vector[K_beta_j] beta_j_raw;                 // coefficients for ability, non-centered parameterization
    
    // random item parameters
    cholesky_factor_corr[n_rnd_i] omega_i;       // correlation of random item covariates
    matrix[n_rnd_i, I] z_par_i;                  // standardized random item covariates
    vector<lower=0>[n_rnd_i] s_par_i;            // standard deviation of random item covariates
    
    // random person parameters
    cholesky_factor_corr[n_rnd_j] omega_j;       // correlation of random person covariates
    matrix[n_rnd_j, J] z_par_j;                  // standardized random person covariates
    vector<lower=0>[n_rnd_j] s_par_j;            // standard deviation of random person covariates
    
    // item-by-person parameters
    vector[K_ibj_b] beta_b_j_raw;                // coefficients for difficulty-person, non-centered parameterization
    vector[K_ibj_a] beta_a_j_raw;                // coefficients for discrimination-person, non-centered parameterization
    
    // pervasive bias parameters
    array[K_bias_b] vector[K_bias_j] beta_bias_b_j_raw; // person coefficients for each of item covariate on difficulty subject to bias
    array[K_bias_a] vector[K_bias_j] beta_bias_a_j_raw; // person coefficients for each of item covariate on discrimination subject to bias
    
    
} // parameters block

transformed parameters {
    
    // IRT parameters
    vector[N] beta;
    vector[N] alpha;
    vector[N] theta;
    
    // extract random item parameters
    matrix[I, (K_rnd_b + (K_bias_b > 0))] rand_b;
    matrix[I, (K_rnd_a + (K_bias_a > 0))] rand_a;
    
    {
        matrix[I, n_rnd_i] int_i = scale_rand(z_par_i, s_par_i, omega_i);
        rand_b = int_i[, id_rnd_b];
        rand_b = int_i[, id_rnd_a];
    } // local environment
    
    
    // extract random person parameters
    matrix[J, n_rnd_j] rand_j = scale_rand(z_par_j, s_par_j, omega_j);
    
    vector[N] beta_raw = rep_vector(0.0, N);
    vector[N] logalpha_raw = rep_vector(0.0, N);
    vector[N] theta_raw = rep_vector(0.0, N);
    
    beta_raw += X_b[ii, unbias_b] * beta_b_raw + X_b_j * beta_b_j_raw;
    logalpha_raw += X_a[ii, unbias_a] * beta_a_raw + X_a_j * beta_a_j_raw;
    
    for ( k in 1 : K_bias_b ) {
        beta_raw += X_b[ii, bias_b[k]] .* (X_j[jj, bias_j] * beta_bias_b_j_raw[k, : ] + 
        R_j_c[jj, id_bias_b[k]] .* rand_j[jj, id_bias_b[k]]);
    }
    
    for ( k in 1 : K_bias_a ) {
        logalpha_raw += X_a[ii, bias_a[k]] .* (X_j[jj, bias_j] * beta_bias_a_j_raw[k, : ] +
        R_j_c[jj, id_bias_a[k]] .* rand_j[jj, id_bias_a[k]]);
    }
    
    theta_raw += X_j[jj, ] * beta_j_raw;
    
    // rescale IRT parameters
    for ( n in 1 : N ) {
        beta[n]  = beta_raw[n] * prior_i[2, ii[n]] + prior_i[1, ii[n]];
        alpha[n] = exp(logalpha_raw[n] * prior_i[4, ii[n]] + prior_i[3, ii[n]]);
    }
    theta = theta_raw * prior_j[2] + prior_j[1];
    
} // transformed parameters block

model {
    
    // non-centered IRT parameters
    target += std_normal_lupdf(beta_raw);
    target += std_normal_lupdf(logalpha_raw);
    target += std_normal_lupdf(theta_raw);
    
    // non-centered latent linear trait priors
    beta_b_raw ~ student_t(3, 0, 1);
    beta_a_raw ~ student_t(3, 0, 1);
    
    // non-centered latent regression priors
    beta_j_raw ~ student_t(3, 0, 1);
    
    // non-centered random item priors
    omega_i ~ lkj_corr_cholesky(1);
    to_vector(z_par_i) ~ std_normal();
    s_par_i ~ student_t(3, 0, 1);
    
    // non-centered random person priors
    omega_j ~ lkj_corr_cholesky(1);
    to_vector(z_par_j) ~ std_normal();
    s_par_j ~ student_t(3, 0, 1);
    
    // non-centered item-by-person priors
    beta_b_j_raw ~ student_t(3, 0, 1);
    beta_a_j_raw ~ student_t(3, 0, 1);
    
    // non-centered pervasive bias parameters
    for ( k in 1 : K_bias_b ) beta_bias_b_j_raw[k, : ] ~ student_t(3, 0, 1);
    for ( k in 1 : K_bias_a ) beta_bias_a_j_raw[k, : ] ~ student_t(3, 0, 1);
    
    // likelihood
    {
        vector[N] eta = alpha .* theta + beta;
        y ~ bernoulli_logit(eta);
    }
    
} // model block

I haven’t robustly tested the model yet, but it seems to be generally doing what I want. The optimize and ADVI estimation methods seem to run reliably without issues. Sampling is, of course, much slower and thus harder to test. Running on a sample of I = 40 and N = 2075 (about the sample size I need it to work on), it took about 20 hours for the standard 1000/1000 warmup/iterations. The only overt warning/error from the model was that every post-warmup iteration hit the maximum tree depth, and some quick eyeballing of the outcome seems to suggest no deeply pathologic outcomes. I also only tested it in the case where K_bias_j, K_bias_a, and K_bias_b were all zero, which is not the desired end use case (I need to add in the data block a way of passing informative priors for the coefficients of the item covariates that are decomposed by including these other variables before trying that addition out).

The model that I’m shooting for should be something like this (apologies if the notation isn’t exactly right here):

y_{ijg} \sim Bernoulli(logit^{-1}(\eta_{ijg}))
\eta_{ijg} = \alpha_{ig}(\theta_{jg} - \beta_{ig})
\alpha_{ijg} = \alpha_0 + \alpha_{0i} + \gamma_{i}^{\alpha}Z_g + \sum_{p=1}^{P}\sum_{k=1}^{K}(\delta_{jp}^{\alpha} + \omega_{gp}^{\alpha}Z_{g} + \omega_{jpk}^{\alpha}Z_j)X_{ip}
\beta_{ijg} = \beta_0 + \beta_{0i} + \gamma_{i}^{\beta}Z_g + \sum_{p=1}^{P}\sum_{k=1}^{K}(\delta_{jp}^{\beta} + \omega_{gp}^{\beta}Z_{g} + \omega_{jpk}^{\beta}Z_j)X_{ip}
\theta_{jg} = \theta_{0j} + \omega_{g}^{\theta}Z_{g} + \sum_{k=1}^{K}(\omega_{k}^{\theta}Z_{jk})

Where i indexes the item number, j the person, and g the group (a dichotomous variable in this case). Variables affecting the item parameters are represented via X while those for person covariates are shown as Z, and there are a total of P item covariates and K person covariates of interest.

I’d love to make the model as efficient as possible since there are some additional things that I’ll need to add to it (e.g,. I need a missing data mechanism for some partially observed person covariates). If it is relevant to those recommendations, I’m using a 2021 Mac Book Pro with an M1 Max processor (10 cores). I’d also greatly appreciate any suggestions for handling the missing data on the covariates – it’s mostly continuous variables that are missing but some are ordinal and then a few more are binary.

I’m also pretty far out from my comfort zone when it comes to writing models in Stan (or really in any kind of non-lme4 style notation for that matter). So, if someone spots an issue or mistake somewhere, then I’d greatly appreciate the sanity check!