# Decomposing Regression Coefficient in Non-Linear Model

I’ve got a model that I’m trying to figure out how to fit in brms if possible. The general idea comes from a specific use case of a linear logistic test model from item response theory. To walk through this quickly, I’m going to attempt to write out the formulas (apologies if these don’t abide by appropriate standard notations, but hopefully they are at least accurate).

The linear logistic test model is usually treated as an extension of the one-parameter logistic regression, which is shown on the logit scale as:

\eta_{ij} = \theta_j + \beta_i

The log-odds, \eta, of person j answering item i correctly is a function of the person’s ability, \theta_j, and the item easiness, \beta_i. The linear logistic test model arises when the item easiness parameter is estimated from covariates such that,

\beta_i = \Sigma_{1}^{k}\omega_kX_k+\delta_i

Here, the item easiness parameter is just a function of various item covariates, X_k, that each have their own coefficients, \omega_k, and then each item can have its own residual estimated by a random intercept, \delta_i

My understanding is that the linear logistic test model would be specified in brms using the following formula:

llt1pm <- brm(resp ~ cov1 + cov2 + cov3 + (1 | item) + (1 | id),
data = df, family = bernoulli(link = "logit"))


Extending this to the two-parameter logistic model is straightforward with brms as well:

llt2pm <- brm(bf(resp ~ exp(logalpha)*theta + beta, nl = TRUE,
theta ~ 0 + (1 | id),
logalpha ~ 1 + cov1 + cov2 + cov3 + (1 | item),
beta ~ 1 + cov1 + cov2 + cov3 + (1 | item)),
data = df, family = bernoulli(link = "logit"))


I’m interested in something akin to differential item functioning but at the level of the item covariates. So, the equation for a one-parameter model would be something like this:

\eta_{ijg} = \theta_j + \omega_gG+ \omega_zZ_j + \delta_i +\omega_1X_1 + \omega_2X_2 + (\delta_j + \omega_gG + \omega_zZ_j)X_3

In effect, the item covariates (X_k) still operate as normal while there is still a permitted random residual associated with each item, \delta_i, but the coefficient or effect for one of the item covariates (X_3 in this case) is permitted to vary over persons. Specifically, it is of interest the extent to which differences in the effect of the item covariate can be separated into an effect due to a person-covariate, Z_j, a group-indicator, G, and then a residual person-level term, \delta_j. Since the person-level covariates are included in the item covariate effect estimation, these same parameters are included as part of the base model to account for potential between-person differences in \theta that may be due to the person-covariate and group indicator. Also, the group indicator is not a dummy variable but is instead coded as -1 and 1 since the emphasis is on interpreting the effect in terms of the difference between two groups.

I cannot seem to find a way to specify this model in brms(). I’m more interested in the two-parameter extension of the formula as well. My first thought was that the specification should look something like this:

llt2pm <- brm(bf(resp ~ exp(logalpha)*theta + beta, nl = TRUE,
theta ~ z + g + (1 | id),
logalpha ~ 1 + cov1 + cov2 + ba*cov3 + (1 | item),
beta ~ 1 + cov1 + cov2 + bb*cov3 + (1 | item),
ba ~ z + g + (1 | id),
bb ~ z + g + (1 | id)),
data = df, family = bernoulli(link = "logit"))


This, of course, does not work because of the combined linear and non-linear components on the beta and logalpha parameters. I cannot, however, figure out how to properly specify the model so that brms will run the model.

Does anyone know how to specify this model in brms? If it’s not possible, then I’m also open to other thoughts about fitting the model.

There might be an easier way to do this in brms, but something that should work is to define a custom family that takes distributional parameters theta, logalpha, beta, ba, and bb, and vreal component cov3. Let’s rename theta as mu. Then you could write something like (this is untested and quite possibly buggy, but it might get you on the right track):

resp | vreal(cov3) ~ z + g + (1 | id), # this is the theta piece, but for brms we have to call something mu
logalpha ~ 1 + cov1 + cov2 + (1 | item), # this is the logalpha piece except for the ba term
beta ~ 1 + cov1 + cov2 + (1 | item), # the beta piece except for the bb term
ba ~ z + g + (1 | id),
bb ~ z + g + (1 | id))


In the definition of the custom family, we’d have something like

real p  = exp(logalpha[i] + ba[i] * vreal1[i]) * mu[i] + beta[i] + bb[i] * vreal1[i];
target += bernoulli_logit_lpmf(y[i] | p);


Realistically, it would probably be easier to just write this model in pure Stan rather than to wrangle this custom family. The reason to do this in brms would be if you want the ability to rapidly and flexibly experiment with alternative specification for the various linear predictors.

1 Like

I figured that something outside the standard brms formulas would be needed in this case. I’m always surprised by the flexibility of the package. I suspect that you’re right in the conclusion that coding directly in Stan is probably the simpler option. I’d wanted to use brms mostly because (a) I’m not super confident in writing multilevel non-linear regressions and (b) I’d like to use the reduce_sum() function to get multithreading going for the model, which brms does automagically

I know that there are quite a few examples of IRT models in Stan out there, so I’ll probably adapt code for that. Any recommendations about speed up options to pursue? I’ve seen some talk about the _glm functions and there being circumstances in which they may or may not be time saving. I’m sure I can get a clunky version of the model going, but it might help of potential directions to take the model for efficiency

I’ve transitioned from brms to cmdstanr and am just writing the model outright now. I’ve run into an issue with trying to pass along relevant prior information, though. So, I’m open to any recommendations for how to approach this

For illustrative purposes, here’s a simplified model that still gets the idea across:

data {
int<lower=1> I;                      // number of items
int<lower=1> J;                      // number of persons
int<lower=1> N;                      // number of observations
array[N] int<lower=1, upper=I> ii;   // indicator for item on observation N
array[N] int<lower=1, upper=J> jj;   // indicator for person on observation N
array[N] int<lower=0, upper=1> y;    // indicator for response on observation N
int<lower=1> K_b;                    // difficulty covariates
matrix[I, K_b] W_b;                  // difficulty covariate matrix
matrix[2, I] prior_logalpha;         // informative priors for log-scaled alpha parameter
}

parameters {
vector[J] theta;                     // latent ability parameters
vector[I] logalpha_c;                // centered log-scaled discrimination parameters
vector[K_b] coef_b;                  // coefficients for item difficulty parameters
}

transformed parameters {
// linear prediction of item difficulty
vector[I] beta = W_b * coef_b;

// rescale item discrimination parameters
vector[I] alpha = exp(logalpha_c * prior_logalpha[2, ] + prior_logalpha[1, ]);
}

model {
// some priors
coef_b ~ std_normal();
theta ~ std_normal();
logalpha_c ~ std_normal();

// likelihood
{
vector[N] eta = alpha[ii] .* theta[jj] - beta[ii];
y ~ bernoulli_logit(eta);
}
}


The problem that I’m running into, and which is hopefully illustrated in the above (untested) code, is that I have some prior information about the parameter values that should be included in the model. These priors are normal approximations using the mean and standard deviation of posteriors from a standard IRT model that just estimates the parameters outright, no explanatory covariates or anything.

The way I’m currently thinking about dealing with this is to specify the coefficients in the linear prediction of the parameters as a simplex so that the coefficients are essentially proportions of parameter of interest. [EDIT: After posting it occurred to me that the simplex is strictly positive, so this will not work as intended. Maybe declaring a unit vector instead?] I think this could then be scaled something like the follow:

data {

... other stuff ...

// informative prior for beta parameter
matrix[2, I] prior_beta;

}

parameters {

... other stuff ...

// unscaled difficulty parameter
vector[I] beta_c;

// simplex coefficient
simplex[K_b] coef_b;

}

transformed parameters {

... other stuff ...

// rescale coefficients
vector[I] beta = (coef_b * W_b) * (beta_c * prior_beta[2, ] + prior_beta[1, ]);

}

model {

... still other stuff ...

// prior on unscaled difficulty parameter
beta_c ~ normal_std();

}


I’m not sure whether the simplex option is doing what I want it to in this case. My fear is that this approach (a) fails to rescale everything like I want it to and/or (b) makes the prior on the IRT parameter way more informative than I want it to be.

I do plan on adding in the random effects as well, so that should help to adjust the parameter estimate away from just reflecting the mean of the prior, though my hope is that the prior on beta_c would also add some wiggle around the prior if needed. I also don’t have a clear idea of how to extract the rescaled coefficients since the only recorded parameter is the simplex version.

Surely there’s an easier way that I’m not thinking of, so open to any and all recommendations!

Just want to find an entry point to an answer here:
Is it reasonable to expect that all of the coefficients will be between 0 and 1? Why?

Shortly after posting the previous update, I started working on the problem from a different angle. In essence, the approach that I’m taking now is to compute the linear prediction component of the model and placing a prior on this constructed value. It’s long, but here’s the full Stan code that I’m currently working with:

functions {

// scale random variables that are correlated
matrix scale_rand(matrix std, vector sigma, matrix omega) {
return transpose(diag_pre_multiply(sigma, omega) * std);
}

} // functions block

data {

// long format item response theory inputs
int<lower=1> I;                              // number of items
int<lower=1> J;                              // number of persons
int<lower=1> N;                              // number of observations (I * J)
array[N] int<lower=1, upper=I> ii;           // index of items
array[N] int<lower=1, upper=J> jj;           // index of persons
array[N] int<lower=0, upper=1> y;            // responses

// latent linear trait model inputs
int<lower=0> K_beta_b;                       // number of fixed predictors on difficulty
int<lower=0> K_beta_a;                       // number of fixed predictors on discrimination
matrix[I, K_beta_b] X_b;                     // covariate matrix for difficulty
matrix[I, K_beta_a] X_a;                     // covariate matrix for discrimination

// latent regression model inputs
int<lower=0> K_beta_j;                       // number of fixed predictors on ability
matrix[J, K_beta_j] X_j;                     // covariate matrix for ability

// random item model inputs
int<lower=0> K_rnd_b;                        // number of random covariates on difficulty
int<lower=0> K_rnd_a;                        // number of random covariates on discrimination
matrix[I, K_rnd_b] R_b;                      // covariate matrix for difficulty
matrix[I, K_rnd_a] R_a;                      // covariate matrix for difficulty

// random person inputs
int<lower=1> K_rnd_j;                        // number of random covariates on ability (must include intercept)
matrix[J, K_rnd_j] R_j;                      // covariate matrix for ability

// item-by-person covariates
int<lower=0> K_ibj_b;                        // number of item-by-person covariates on difficulty
int<lower=0> K_ibj_a;                        // number of item-by-person covariates on discrimination
matrix[N, K_ibj_b] X_b_j;                    // item-by-person covariate matrix for difficulty
matrix[N, K_ibj_a] X_a_j;                    // item-by-person covariate matrix for discrimination

// pervasive bias model inputs
int<lower=0> K_bias_j;                       // number of explanatory person-variables
int<lower=0> K_bias_b;                       // number of difficulty item covariates to evaluate
int<lower=0> K_bias_a;                       // number of discrimination item covariates to evaluate
array[K_bias_j] int bias_j;                  // column number(s) for person variable(s) (bias_j *must* be the group variable)
array[K_bias_b] int bias_b;                  // column number(s) for difficulty variable(s)
array[K_bias_a] int bias_a;                  // column number(s) for discrimination variable(s)

// informative priors
matrix[4, I] prior_i;                        // item priors (rows = mu_beta, sd_beta, mu_logalpha, sd_logalpha; columns = items)
vector prior_j;                           // person priors (mu_theta, sd_theta)

} // data block

transformed data {

// identifiers for random items effects by parameter
array[(K_rnd_b + (K_bias_b > 0))] int id_rnd_b;
array[(K_rnd_a + (K_bias_a > 0))] int id_rnd_a;

// identifiers for random person effects by parameter
array[K_rnd_j] int id_rnd_j;
array[K_bias_b] int id_bias_b;
array[K_bias_a] int id_bias_a;

// combine person random effects into single matrix
matrix[J, (K_rnd_j + K_bias_b + K_bias_a)] R_j_c;

// create random item effect identifiers
for ( k in 1 : (K_rnd_b + (K_bias_b > 0)) ) id_rnd_b[k] = k;
for ( k in 1 : (K_rnd_a + (K_bias_a > 0)) ) id_rnd_a[k] = k + (K_rnd_b + (K_bias_b > 0));

// create random person effect identifiers
for ( k in 1 : K_rnd_j ) id_rnd_j[k] = k;
for ( k in 1 : K_bias_b ) id_bias_b[k] = k + K_rnd_j;
for ( k in 1 : K_bias_a ) id_bias_a[k] = k + K_rnd_j + K_bias_b;

// create person random effects matrix
{
matrix[J, (K_rnd_j + K_bias_b)] R_j_temp = append_col(R_j, rep_matrix(rep_vector(1.0, J), K_bias_b));
R_j_c = append_col(R_j_temp, rep_matrix(rep_vector(1.0, J), K_bias_a));
} // local environment

// convenience count of random effects on items
int n_rnd_j = K_rnd_j + K_bias_b + K_bias_a;

// convenience count of random effects on items
int n_rnd_i = K_rnd_b + K_rnd_a + (K_bias_b > 0) + (K_bias_a > 0);

// index for all covariates examined outside of pervasive bias
array[(K_beta_b - K_bias_b)] int unbias_b;
array[(K_beta_a - K_bias_a)] int unbias_a;

{
int b_b = 1;                             // initialize indicator on difficulty included in bias
int b_a = 1;                             // initialize indicator on discrimination included in bias
int f_b = 1;                             // initialize indicator on difficulty excluded from bias
int f_a = 1;                             // initialize indicator on discrimination excluded from bias

if ( K_bias_b > 0 ) {
for ( k in 1 : K_beta_b ) {

if ( k == bias_b[b_b] ) {
b_b += 1;                        // increment indicator for next biased variable
} else {
unbias_b[f_b] = k;
f_b += 1;                    // increment indicator for next non-biased variable
} // if statement

} // for loop
} else {

for ( k in 1 : K_beta_b ) unbias_b[k] = k;
} // if statement

if ( K_bias_a > 0 ) {

for ( k in 1 : K_beta_a ) {

if ( k == bias_a[b_a] ) {
b_a += 1;                        // increment indicator for next biased variable
} else {
unbias_a[f_a] = k;
f_a += 1;                    // increment indicator for next non-biased variable
} // if statement

} // for loop
} else {

for ( k in 1 : K_beta_a ) unbias_a[k] = k;
} // if statement

} // local environment

} // transformed data block

parameters {

// latent linear trait parameters
vector[(K_beta_b - K_bias_b)] beta_b_raw;    // coefficients for difficulty, non-centered parameterization
vector[(K_beta_a - K_bias_a)] beta_a_raw;    // coefficients for discrimination, non-centered parameterization

// latent regression parameters
vector[K_beta_j] beta_j_raw;                 // coefficients for ability, non-centered parameterization

// random item parameters
cholesky_factor_corr[n_rnd_i] omega_i;       // correlation of random item covariates
matrix[n_rnd_i, I] z_par_i;                  // standardized random item covariates
vector<lower=0>[n_rnd_i] s_par_i;            // standard deviation of random item covariates

// random person parameters
cholesky_factor_corr[n_rnd_j] omega_j;       // correlation of random person covariates
matrix[n_rnd_j, J] z_par_j;                  // standardized random person covariates
vector<lower=0>[n_rnd_j] s_par_j;            // standard deviation of random person covariates

// item-by-person parameters
vector[K_ibj_b] beta_b_j_raw;                // coefficients for difficulty-person, non-centered parameterization
vector[K_ibj_a] beta_a_j_raw;                // coefficients for discrimination-person, non-centered parameterization

// pervasive bias parameters
array[K_bias_b] vector[K_bias_j] beta_bias_b_j_raw; // person coefficients for each of item covariate on difficulty subject to bias
array[K_bias_a] vector[K_bias_j] beta_bias_a_j_raw; // person coefficients for each of item covariate on discrimination subject to bias

} // parameters block

transformed parameters {

// IRT parameters
vector[N] beta;
vector[N] alpha;
vector[N] theta;

// extract random item parameters
matrix[I, (K_rnd_b + (K_bias_b > 0))] rand_b;
matrix[I, (K_rnd_a + (K_bias_a > 0))] rand_a;

{
matrix[I, n_rnd_i] int_i = scale_rand(z_par_i, s_par_i, omega_i);
rand_b = int_i[, id_rnd_b];
rand_b = int_i[, id_rnd_a];
} // local environment

// extract random person parameters
matrix[J, n_rnd_j] rand_j = scale_rand(z_par_j, s_par_j, omega_j);

vector[N] beta_raw = rep_vector(0.0, N);
vector[N] logalpha_raw = rep_vector(0.0, N);
vector[N] theta_raw = rep_vector(0.0, N);

beta_raw += X_b[ii, unbias_b] * beta_b_raw + X_b_j * beta_b_j_raw;
logalpha_raw += X_a[ii, unbias_a] * beta_a_raw + X_a_j * beta_a_j_raw;

for ( k in 1 : K_bias_b ) {
beta_raw += X_b[ii, bias_b[k]] .* (X_j[jj, bias_j] * beta_bias_b_j_raw[k, : ] +
R_j_c[jj, id_bias_b[k]] .* rand_j[jj, id_bias_b[k]]);
}

for ( k in 1 : K_bias_a ) {
logalpha_raw += X_a[ii, bias_a[k]] .* (X_j[jj, bias_j] * beta_bias_a_j_raw[k, : ] +
R_j_c[jj, id_bias_a[k]] .* rand_j[jj, id_bias_a[k]]);
}

theta_raw += X_j[jj, ] * beta_j_raw;

// rescale IRT parameters
for ( n in 1 : N ) {
beta[n]  = beta_raw[n] * prior_i[2, ii[n]] + prior_i[1, ii[n]];
alpha[n] = exp(logalpha_raw[n] * prior_i[4, ii[n]] + prior_i[3, ii[n]]);
}
theta = theta_raw * prior_j + prior_j;

} // transformed parameters block

model {

// non-centered IRT parameters
target += std_normal_lupdf(beta_raw);
target += std_normal_lupdf(logalpha_raw);
target += std_normal_lupdf(theta_raw);

// non-centered latent linear trait priors
beta_b_raw ~ student_t(3, 0, 1);
beta_a_raw ~ student_t(3, 0, 1);

// non-centered latent regression priors
beta_j_raw ~ student_t(3, 0, 1);

// non-centered random item priors
omega_i ~ lkj_corr_cholesky(1);
to_vector(z_par_i) ~ std_normal();
s_par_i ~ student_t(3, 0, 1);

// non-centered random person priors
omega_j ~ lkj_corr_cholesky(1);
to_vector(z_par_j) ~ std_normal();
s_par_j ~ student_t(3, 0, 1);

// non-centered item-by-person priors
beta_b_j_raw ~ student_t(3, 0, 1);
beta_a_j_raw ~ student_t(3, 0, 1);

// non-centered pervasive bias parameters
for ( k in 1 : K_bias_b ) beta_bias_b_j_raw[k, : ] ~ student_t(3, 0, 1);
for ( k in 1 : K_bias_a ) beta_bias_a_j_raw[k, : ] ~ student_t(3, 0, 1);

// likelihood
{
vector[N] eta = alpha .* theta + beta;
y ~ bernoulli_logit(eta);
}

} // model block


I haven’t robustly tested the model yet, but it seems to be generally doing what I want. The optimize and ADVI estimation methods seem to run reliably without issues. Sampling is, of course, much slower and thus harder to test. Running on a sample of I = 40 and N = 2075 (about the sample size I need it to work on), it took about 20 hours for the standard 1000/1000 warmup/iterations. The only overt warning/error from the model was that every post-warmup iteration hit the maximum tree depth, and some quick eyeballing of the outcome seems to suggest no deeply pathologic outcomes. I also only tested it in the case where K_bias_j, K_bias_a, and K_bias_b were all zero, which is not the desired end use case (I need to add in the data block a way of passing informative priors for the coefficients of the item covariates that are decomposed by including these other variables before trying that addition out).

The model that I’m shooting for should be something like this (apologies if the notation isn’t exactly right here):

y_{ijg} \sim Bernoulli(logit^{-1}(\eta_{ijg}))
\eta_{ijg} = \alpha_{ig}(\theta_{jg} - \beta_{ig})
\alpha_{ijg} = \alpha_0 + \alpha_{0i} + \gamma_{i}^{\alpha}Z_g + \sum_{p=1}^{P}\sum_{k=1}^{K}(\delta_{jp}^{\alpha} + \omega_{gp}^{\alpha}Z_{g} + \omega_{jpk}^{\alpha}Z_j)X_{ip}
\beta_{ijg} = \beta_0 + \beta_{0i} + \gamma_{i}^{\beta}Z_g + \sum_{p=1}^{P}\sum_{k=1}^{K}(\delta_{jp}^{\beta} + \omega_{gp}^{\beta}Z_{g} + \omega_{jpk}^{\beta}Z_j)X_{ip}
\theta_{jg} = \theta_{0j} + \omega_{g}^{\theta}Z_{g} + \sum_{k=1}^{K}(\omega_{k}^{\theta}Z_{jk})

Where i indexes the item number, j the person, and g the group (a dichotomous variable in this case). Variables affecting the item parameters are represented via X while those for person covariates are shown as Z, and there are a total of P item covariates and K person covariates of interest.

I’d love to make the model as efficient as possible since there are some additional things that I’ll need to add to it (e.g,. I need a missing data mechanism for some partially observed person covariates). If it is relevant to those recommendations, I’m using a 2021 Mac Book Pro with an M1 Max processor (10 cores). I’d also greatly appreciate any suggestions for handling the missing data on the covariates – it’s mostly continuous variables that are missing but some are ordinal and then a few more are binary.

I’m also pretty far out from my comfort zone when it comes to writing models in Stan (or really in any kind of non-lme4 style notation for that matter). So, if someone spots an issue or mistake somewhere, then I’d greatly appreciate the sanity check!