Ordered Logistic - Small Sample Bias

I am working up to a stratified sampling model for ordered categorical data. I expect some small sample sizes. When I run a simple model on one sample with n small, the mean probabilities are well off the sample probabilities. When I run a larger sample, the means are close. Should I be worried? The models are below.

Stan Model for Ordered Categorical - Small n

options(digits=3)

data

c1= rep(1,1)
c2 = rep(2,2)
c3 = rep(3,3)
c4 = rep(4,4)
scores = append(c1,c2)
scores = append(scores,c3)
scores = append(scores,c4)
N = length(scores)

library(rstan)

rs=list(N=N,scores=scores)

build stan model

stanstratrubric_ord = "
data {
int<lower=0> N; // number of observations
int<lower=1,upper=4> scores[N]; // list of scores
}
parameters {
ordered[3] cutpoints; // 3 cutpoints for 4 scores, keeps in order
}
model {
cutpoints ~ normal(0,10); // our prior for cutpoints (actually like -3->p=.05,-2,-1,0->p=.50 etc)
for (i in 1:N) {
scores[i] ~ ordered_logistic(0, cutpoints);
}
}
generated quantities {
vector[3] cumprob; // cumulative probabilities
vector[4] prob; // probabilities
cumprob = inv_logit(cutpoints);
prob[1] = cumprob[1];
prob[2] = cumprob[2] - cumprob[1];
prob[3] = cumprob[3] - cumprob[2];
prob[4] = 1 - cumprob[3];
}
"

fit <- stan(model_code = stanstratrubric_ord, data = rs, iter = 4000, warmup = 2000,chains = 4,control = list(adapt_delta = 0.99),refresh = 0)
get_posterior_mean(fit,par=c(“prob[1]”,“prob[2]”,“prob[3]”,“prob[4]”))

Stan Model for Ordered Categorical - Large n

options(digits=3)

data

c1= rep(1,10)
c2 = rep(2,20)
c3 = rep(3,30)
c4 = rep(4,40)
scores = append(c1,c2)
scores = append(scores,c3)
scores = append(scores,c4)
N = length(scores)

library(rstan)

rs=list(N=N,scores=scores)

build stan model

stanstratrubric_ord = "
data {
int<lower=0> N; // number of observations
int<lower=1,upper=4> scores[N]; // list of scores
}
parameters {
ordered[3] cutpoints; // 3 cutpoints for 4 scores, keeps in order
}
model {
cutpoints ~ normal(0,10); // our prior for cutpoints (actually like -3->p=.05,-2,-1,0->p=.50 etc)
for (i in 1:N) {
scores[i] ~ ordered_logistic(0, cutpoints);
}
}
generated quantities {
vector[3] cumprob; // cumulative probabilities
vector[4] prob; // probabilities
cumprob = inv_logit(cutpoints);
prob[1] = cumprob[1];
prob[2] = cumprob[2] - cumprob[1];
prob[3] = cumprob[3] - cumprob[2];
prob[4] = 1 - cumprob[3];
}
"

fit <- stan(model_code = stanstratrubric_ord, data = rs, iter = 4000, warmup = 2000,chains = 4,control = list(adapt_delta = 0.99),refresh = 0)
get_posterior_mean(fit,par=c(“prob[1]”,“prob[2]”,“prob[3]”,“prob[4]”))

This is to be expected. If you output the full summary via print(fit) the output for the small N case would be something like:

Inference for Stan model: 2c592c48b51f3916e2aa109e2a468f2c.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

               mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
cutpoints[1]  -3.05    0.03 1.31  -6.23  -3.81  -2.86  -2.14  -0.99  2353    1
cutpoints[2]  -1.01    0.01 0.72  -2.52  -1.47  -0.98  -0.52   0.32  5584    1
cutpoints[3]   0.53    0.01 0.68  -0.79   0.08   0.51   0.96   1.92  6939    1
cumprob[1]     0.08    0.00 0.07   0.00   0.02   0.05   0.11   0.27  3294    1
cumprob[2]     0.29    0.00 0.13   0.07   0.19   0.27   0.37   0.58  6033    1
cumprob[3]     0.62    0.00 0.14   0.31   0.52   0.63   0.72   0.87  6892    1
prob[1]        0.08    0.00 0.07   0.00   0.02   0.05   0.11   0.27  3294    1
prob[2]        0.21    0.00 0.12   0.04   0.13   0.19   0.28   0.49  5542    1
prob[3]        0.33    0.00 0.13   0.10   0.23   0.32   0.42   0.62  3665    1
prob[4]        0.38    0.00 0.14   0.13   0.28   0.37   0.48   0.69  6892    1
lp__         -13.56    0.03 1.25 -16.73 -14.14 -13.22 -12.66 -12.13  2398    1

Samples were drawn using NUTS(diag_e) at Wed Aug 28 17:23:00 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

While the means for prob[i] jump around the actual values, you may notice the large uncertainty due to the small sample size. I.e. the results tell you that you shouldn’t trust the mean estimates very much, as e.g. for prob[1] the full range from 0 to 0.27 lies withing the 95% credible interval. With larger N the uncertainty decreases and means become more informative. I think the take home message is that you cannot only report the mean estimate and reporting the associated uncertainty (95% CI or at least the sd) is very important.

Finally, two minor notes:

  • Unless you are aiming for something very special, you might want to use the brms package which will give you a well tested implementation of ordinal regression almost for free.
  • You can use triple backticks (```) to make code blocks in the post neatly formatted

Hope that helps.

Thanks Martin for your reply.
As I understand it, the “biased” means are caused by the fact that the underlying searcher is bumping against the end bounds (0 and 1), though it still seems that means should center about the sample values.

I will look into using the backticks.

As far as using brms, I prefer a process of knowledge accretion which in this case, as you can see, I try to build increasingly more complex models.
I really appreciate you taking the time to reply.

Oh, I see, I misunderstood you earlier. So the bias you care about is that when you fit the model repeatedly, the prob[i] are always skewed the same? This is IMHO still unsurprising, but for a different reason: the sampler is expected to give very similar results when run on the same data. It should jump around the true values when the data would be simulated from exactly the model you’ve encoded.

But you have one fixed dataset and this dataset is not very typical of your prior. If I simulate the cutpoints according to your prior, I get (please double check the code):

N_sims <- 1000

cutpoints <- array(NA_real_, c(N_sims, 3))
probs <- array(NA_real_, c(N_sims, 4))

for(i in 1:N_sims) {
	cutpoints[i,] <- sort(rnorm(3, 0, 10)) #Enforce the ordering
	cumprob <- 1/(1 + exp(-cutpoints[i,]))
	probs[i,] <- diff(c(0, cumprob, 1))
}

# Plot implied univariete prior distributions
hist(probs[,1])
hist(probs[,2])
hist(probs[,3])
hist(probs[,4])

# Prior probability probs[1] < 0.08
mean(probs[,1] < 0.08)

The implied prior for probs[1] is:
image

And the prior probability that probs[1] < 0.08 is around 0.8. The prior actually strongly favors small probabilities for the extreme categories. Since you have a small dataset, the prior has noticeable influence on your inferences (but the wide uncertainty signals that you cannot have a lot of trust in the precise estimates anyway).

Indeed, when I change the prior, the results change. E.g. when I put cutpoints ~ normal(0, 2); the results change to something like:

               mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
cutpoints[1]  -2.15    0.02 0.86  -3.99  -2.72  -2.09  -1.55  -0.63  3033    1
cutpoints[2]  -0.71    0.01 0.61  -1.94  -1.11  -0.70  -0.29   0.45  6094    1
cutpoints[3]   0.63    0.01 0.61  -0.53   0.21   0.61   1.03   1.87  7196    1
cumprob[1]     0.13    0.00 0.09   0.02   0.06   0.11   0.17   0.35  3164    1
cumprob[2]     0.34    0.00 0.13   0.13   0.25   0.33   0.43   0.61  6333    1
cumprob[3]     0.64    0.00 0.13   0.37   0.55   0.65   0.74   0.87  7142    1
prob[1]        0.13    0.00 0.09   0.02   0.06   0.11   0.17   0.35  3164    1
prob[2]        0.21    0.00 0.11   0.05   0.13   0.20   0.28   0.46  5054    1
prob[3]        0.30    0.00 0.12   0.10   0.21   0.29   0.38   0.56  4043    1
prob[4]        0.36    0.00 0.13   0.13   0.26   0.35   0.45   0.63  7142    1
lp__         -14.46    0.03 1.27 -17.55 -15.06 -14.14 -13.53 -13.01  2484    1

Showing the influence of prior choice.

It is actually quite hard to figure out a prior for the cutpoints that is not very tight AND gives large prior probability to a wide spread of the ordinal responses, Mike Betancourt has an interesting attempt in his case study on the topic (Ordinal Regression), but I have no direct experience with it.

Hope I’ve made things clearer for you.

1 Like

Hi Martin, Thanks for the explanation. My waking thought this morning was that if a category has a non-zero count, then I know the probability for that category is non-zero and if I have a good sense of the population size, I have a strict lower bound on the probability. I was wondering if there was a way to incorporate that into the model. As you have shown, small sample results are sensitive to the prior. thx, jim

While I am by no means an expert on Bayesian computation, I am not so sure about this: the ordered logistic model already implies that all categories have non-zero (although possibly negligible) probability, regardless of the data you collect. Seeing at least one response in the category implies the probability is not-totally-negligible, but that’s IMHO about it. Same with “strict lower bound” - yes, the posterior probability that the category has very low true probability would tend to be vanishingly small, but there is nothing “strict” about it. E.g. after collecting 100 responses, 10 of which are in category 1, it is still possible that the true probability of category 1 is <0.001, although it is highly unlikely.

All in all, I would guess that the reasoning you care about is mostly (if not completely) handled by the computed posterior probability - everything that is implied by the model, will be reflected there. Unless I am missing something, your comments do not seem to introduce any new knowledge beyond some informal reasoning about the model.

Hope that helps!