Normalised count variable as a predictor of a regression model

I want to use a count variable (number of dependent clauses in an essay) as a predictor in a regression model. Since the essay length varies across the writers, it should be normalised (e.g., into the number of dependent clauses per sentence). But since the length influences the reliability of the data point (e.g., 1 dependent clause observed in 3 sentences is less reliable than 10 dependent clauses observed in 30 sentences), I thought I should build a model that receives both the count variable and also the denominator for normalisation (e.g., number of sentences) in order for the model to reflect the difference in reliability

One idea that I thought may work is to build something like a measurement-error model where the count variable is generated from a Poisson (or negative binomial) distribution with the true, latent normalized count variable multiplied by the observed denominator as the parameter. I can estimate the true, normalized count from it and then include it in a regression model.

I thought the above is fairly straightforward, but a simple simulation shows that the parameter values of the regression model are not recovered accurately. I am wondering what I am doing wrongly with the code below.

Stan code:

data {
  int<lower=1> N;              // number of data points
  int<lower=0> ndepcl[N];      // number of dependent clauses (target count variable)
  int<lower=1> nsents[N];      // number of sentences (offset/exposure)
  vector[N] prof;              // proficiency (outcome variable)
}
parameters {
  real intercept;         
  real slope;             
  real<lower=0> sigma;    
  real<lower=0> depsent_true[N]; // unknown true value of the number of dependent clauses per sentence
}
model {
  for (n in 1:N) {
    // measurement model
    ndepcl[n] ~ poisson(depsent_true[n] * nsents[n]);
    // structural model
    prof[n] ~ normal(intercept + slope * depsent_true[n], sigma);
  }
}

R code:

# create fake data
d <- data.frame(
  # true value increases gradually from 0 to 1.5
  depsent.true = seq(0, 1.5, length = 50), 
  # number of sentences
  nsents = 10
)
# the outcome is linearly related to the true value, with the intercept of 1 and the slope of 33
set.seed(1)
d$prof <- 1 + 33 * d$depsent.true + rnorm(50, sd = 2)
# the observed count variable is generated from a Poisson distribution
set.seed(1)
for (i in 1:nrow(d)) {
  d$ndepcl[i] <- sum(rpois(d$nsents[i], d$depsent.true[i]))
}

# fit the model
fit <- stan(file = 'ratio.as.predictor.stan',
            data = list(
              N = nrow(d),
              ndepcl = d$ndepcl,
              nsents = d$nsents,
              prof = d$prof
            ),
            iter = 5000,
            seed = 1
)

# see the summary of the results
round(summary(fit, pars = c("intercept", "slope", "sigma"))$summary, 2)
#            mean se_mean   sd  2.5%   25%   50%   75% 97.5%  n_eff Rhat
# intercept  5.16    0.07 2.01  1.70  3.73  5.00  6.43  9.51 768.22 1.01
# slope     25.44    0.10 2.67 20.10 23.71 25.45 27.23 30.55 715.16 1.01
# sigma      4.46    0.06 1.41  1.94  3.45  4.40  5.38  7.36 474.85 1.01

As can be seen above, the true values of the intercept and the slope (1 and 33, respectively) fall outside of their corresponding 95% CIs. The deviation gets even larger when I specify a smaller denominator (e.g., nsents = 1). What am I missing here and how could I fix the issue?

Usually when you convert binomial to Poisson, you want to use a proportion here, declared with <lower=0, upper=1>. You are also simulating with values between 0 and 1.5 inclusive. The values of 0 are probably not going to work well as they’re degenerate to the boundary of the Poisson parameters. Also, I have no idea how that data.frame thing works with one vector and one scalar as arguments—I’m not very good with R. But you can use Stan to generate data if you have priors and then you can use those priors and the model should be well calibrated in expectation (i.e., over multiple runs). I’m also not sure why you’re summing over rpois output. Don’t you want to multiply those arguments and then just do rpois(1, d$nsents[i] * d$depsent.true[I])? That matches what you have the model, I think.

Also, you can make the Stan code a lot simpler by removing the loop in the model and using vectorization

ndepcl ~ poisson(depsent_true .* nsents);
prof ~ normal(intercept + slope * depsent_true, sigma);

I would also suggest adding reasonable weakly informative priors based on the scale of the expected values for the parameters.

Thank you for your reply.

The depsent_true is not a binomial as it can in theory be larger than one (i.e., a sentence can include multiple dependent clauses, like “If X happens, I would do Y because . . .”), though it is rare in practice.

In the observed data, there are a number of cases where ndepcl is 0. If Poisson distribution is unlikely to model the variable well, what would you suggest that I use instead? Zero-inflated Poisson distribution? Could this be the cause of the mismatch between the true value and the estimated parameter value?

The scalar (10) is repeated as many times as the number of elements in the vector (i.e., 50 times). I created the data with R only because I’m more familiar with it. But thank you for pointing to the possibility that I do the same with Stan. I didn’t think about it due to my limited competence in Stan.

I agree. Thank you for these suggestions.