I want to use a count variable (number of dependent clauses in an essay) as a predictor in a regression model. Since the essay length varies across the writers, it should be normalised (e.g., into the number of dependent clauses per sentence). But since the length influences the reliability of the data point (e.g., 1 dependent clause observed in 3 sentences is less reliable than 10 dependent clauses observed in 30 sentences), I thought I should build a model that receives both the count variable and also the denominator for normalisation (e.g., number of sentences) in order for the model to reflect the difference in reliability
One idea that I thought may work is to build something like a measurement-error model where the count variable is generated from a Poisson (or negative binomial) distribution with the true, latent normalized count variable multiplied by the observed denominator as the parameter. I can estimate the true, normalized count from it and then include it in a regression model.
I thought the above is fairly straightforward, but a simple simulation shows that the parameter values of the regression model are not recovered accurately. I am wondering what I am doing wrongly with the code below.
Stan code:
data {
int<lower=1> N; // number of data points
int<lower=0> ndepcl[N]; // number of dependent clauses (target count variable)
int<lower=1> nsents[N]; // number of sentences (offset/exposure)
vector[N] prof; // proficiency (outcome variable)
}
parameters {
real intercept;
real slope;
real<lower=0> sigma;
real<lower=0> depsent_true[N]; // unknown true value of the number of dependent clauses per sentence
}
model {
for (n in 1:N) {
// measurement model
ndepcl[n] ~ poisson(depsent_true[n] * nsents[n]);
// structural model
prof[n] ~ normal(intercept + slope * depsent_true[n], sigma);
}
}
R code:
# create fake data
d <- data.frame(
# true value increases gradually from 0 to 1.5
depsent.true = seq(0, 1.5, length = 50),
# number of sentences
nsents = 10
)
# the outcome is linearly related to the true value, with the intercept of 1 and the slope of 33
set.seed(1)
d$prof <- 1 + 33 * d$depsent.true + rnorm(50, sd = 2)
# the observed count variable is generated from a Poisson distribution
set.seed(1)
for (i in 1:nrow(d)) {
d$ndepcl[i] <- sum(rpois(d$nsents[i], d$depsent.true[i]))
}
# fit the model
fit <- stan(file = 'ratio.as.predictor.stan',
data = list(
N = nrow(d),
ndepcl = d$ndepcl,
nsents = d$nsents,
prof = d$prof
),
iter = 5000,
seed = 1
)
# see the summary of the results
round(summary(fit, pars = c("intercept", "slope", "sigma"))$summary, 2)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# intercept 5.16 0.07 2.01 1.70 3.73 5.00 6.43 9.51 768.22 1.01
# slope 25.44 0.10 2.67 20.10 23.71 25.45 27.23 30.55 715.16 1.01
# sigma 4.46 0.06 1.41 1.94 3.45 4.40 5.38 7.36 474.85 1.01
As can be seen above, the true values of the intercept and the slope (1 and 33, respectively) fall outside of their corresponding 95% CIs. The deviation gets even larger when I specify a smaller denominator (e.g., nsents
= 1). What am I missing here and how could I fix the issue?