Modeling variation when aggregated sample size is small

This blog post talks about insensitivity to sample size bias and uses an example of the smallest and largest counties having the highest cancer ‘rates’ (# cancer incidences / county population).
I simulated this data with python (with # of cancer cases coming from binomial(Population, .02)),

import pandas as pd, numpy as np, scipy.stats as sts, altair.vegalite.v3 as A
nz = A.Scale(zero=False)

n_counties = 500
df = (
        dict(county_pop=(np.exp(np.random.randn(n_counties) * 3 + 10)).astype(int))
    .query("county_pop  < .5e6 & county_pop > 1e4")
    .assign(n_cancer_cases=lambda x: sts.binom.rvs(x.county_pop, 0.02))
df["cancer_rates"] = df.eval("n_cancer_cases / county_pop")

and it ends up looking like this:


As in the blog post, the smaller counties show the largest variation in the rates of cancer, but the mean rate is constant. Are there techniques to model this variation? In the past I’ve simply done things like have \sigma vary inversely with the population size.

Here’s an example program:

data {
  int<lower=0> N;
  vector[N] inv_sqrt_pop;
  vector[N] rate;
  real scale;

parameters {
      real sig_alpha;
      real<lower=0> sig_beta;
      real<lower=0> sig_sig;
      vector<lower=0>[N] sigma;
      real<lower=0, upper=1> rate_mu;


model {
  vector[N] sig_mu = sig_alpha + sig_beta * inv_sqrt_pop;
  sig_alpha ~ normal(0, scale);
  sig_beta ~ normal(0, scale);
  sig_sig ~ normal(0, scale);
  sigma ~ normal(sig_mu, sig_sig); 
  rate ~ normal(rate_mu, sigma); 

Is there any distribution or technique that would get closer to modeling the underlying data generating process? A downside of this is that it seems to depend heavily on the number of observations I have. Also, I’m not sure my current method sufficiently captures the relationship between small county size and the variation in the rate.

Thank you

Welcome back wcbeard,

in your post, you’ve been talking about a binomial distribution, but your Stan code shows
only normal distributions. I’m not so sure, why is this?

You may use a beta distribution,

Y ~ beta(rate * scale, (1.0 - rate) * scale)

The parameter rate your proportion and scale defines how narrow/spread your data is.

If you want to choose Y continuous then you may choose a student_t distribution and let
parameter nu vary.

Hi andre, thanks for the response!

in your post, you’ve been talking about a binomial distribution, but your Stan code shows
only normal distributions

Sorry, I probably wasn’t clear enough. I just wanted to use the binomial distribution to generate the data, but my interest is more where

  • I don’t know the exact process that generated the data
  • but I know that some observations will be associated with smaller sample sizes, and I should thus expect them to have more variance

What I have a hard time with is that when estimating a parameter like the rate, I want the observations that aggregate larger numbers to have more weight, and the smaller ones to have less weight.

As an example, since Los Angeles county has a very large population, I want the observed rate to count more than the tiny Alpine County observed rate. I also want the model to take into account that the Alpine rate could have a higher variance.

But I believe the way Stan works, as I naively code it, is that each data point/observation gets equal weight, despite the fact that LA county represents a significantly higher population.

If you want to choose Y continuous then you may choose a student_t distribution and let
parameter nu vary

Thank you for that pointer. Does student_t offer an advantage over the normal in this case, or is it simply another option to try?

Thank you very much!

Is your data a probability or just a real number?

I case of a real number the student_t with \nu = n -1 with degrees of freedom represents n observations, and thus for small \nu reflects a larger variance. See wikipedia.

In the case of a probability measure you could go with a beta distribution or with a logit/probit approach and use the student_t approach.

Thank you, I’m interested in the general solution of rate being a real, and from reading that description it looks like student_t would be good for modeling my case.

For the probability case, when you say

a logit/probit approach and use the student_t approach

do you mean something like

x ~ student_t(...);
y ~ bernoulli_logit(x);

Thanks again

I meant something like this.
Take a look also at the beta distribution, it worth knowing about providing a good example: