Modeling variation when aggregated sample size is small

wcbeard · March 4, 2020, 4:38pm

This blog post talks about insensitivity to sample size bias and uses an example of the smallest and largest counties having the highest cancer ‘rates’ (# cancer incidences / county population).
I simulated this data with python (with # of cancer cases coming from binomial(Population, .02)),

import pandas as pd, numpy as np, scipy.stats as sts, altair.vegalite.v3 as A
nz = A.Scale(zero=False)

n_counties = 500
df = (
    pd.DataFrame(
        dict(county_pop=(np.exp(np.random.randn(n_counties) * 3 + 10)).astype(int))
    )
    .query("county_pop  < .5e6 & county_pop > 1e4")
    .assign(n_cancer_cases=lambda x: sts.binom.rvs(x.county_pop, 0.02))
)
df["cancer_rates"] = df.eval("n_cancer_cases / county_pop")

and it ends up looking like this:

As in the blog post, the smaller counties show the largest variation in the rates of cancer, but the mean rate is constant. Are there techniques to model this variation? In the past I’ve simply done things like have \sigma vary inversely with the population size.

Here’s an example program:

data {
  int<lower=0> N;
  vector[N] inv_sqrt_pop;
  vector[N] rate;
  real scale;
}

parameters {
      real sig_alpha;
      real<lower=0> sig_beta;
      real<lower=0> sig_sig;
      vector<lower=0>[N] sigma;
      real<lower=0, upper=1> rate_mu;

}

model {
  vector[N] sig_mu = sig_alpha + sig_beta * inv_sqrt_pop;
  sig_alpha ~ normal(0, scale);
  sig_beta ~ normal(0, scale);
  sig_sig ~ normal(0, scale);
  
  sigma ~ normal(sig_mu, sig_sig); 
  rate ~ normal(rate_mu, sigma); 
}

Is there any distribution or technique that would get closer to modeling the underlying data generating process? A downside of this is that it seems to depend heavily on the number of observations I have. Also, I’m not sure my current method sufficiently captures the relationship between small county size and the variation in the rate.

Thank you
Chris

andre.pfeuffer · March 5, 2020, 5:35pm

Welcome back wcbeard,

in your post, you’ve been talking about a binomial distribution, but your Stan code shows
only normal distributions. I’m not so sure, why is this?

You may use a beta distribution,

Y ~ beta(rate * scale, (1.0 - rate) * scale)

The parameter rate your proportion and scale defines how narrow/spread your data is.

If you want to choose Y continuous then you may choose a student_t distribution and let
parameter nu vary.

wcbeard · March 5, 2020, 5:51pm

Hi andre, thanks for the response!

in your post, you’ve been talking about a binomial distribution, but your Stan code shows
only normal distributions

Sorry, I probably wasn’t clear enough. I just wanted to use the binomial distribution to generate the data, but my interest is more where

I don’t know the exact process that generated the data
but I know that some observations will be associated with smaller sample sizes, and I should thus expect them to have more variance

What I have a hard time with is that when estimating a parameter like the rate, I want the observations that aggregate larger numbers to have more weight, and the smaller ones to have less weight.

As an example, since Los Angeles county has a very large population, I want the observed rate to count more than the tiny Alpine County observed rate. I also want the model to take into account that the Alpine rate could have a higher variance.

But I believe the way Stan works, as I naively code it, is that each data point/observation gets equal weight, despite the fact that LA county represents a significantly higher population.

If you want to choose Y continuous then you may choose a student_t distribution and let
parameter nu vary

Thank you for that pointer. Does student_t offer an advantage over the normal in this case, or is it simply another option to try?

Thank you very much!
Chris

andre.pfeuffer · March 5, 2020, 6:17pm

Is your data a probability or just a real number?

I case of a real number the student_t with \nu = n -1 with degrees of freedom represents n observations, and thus for small \nu reflects a larger variance. See wikipedia.

In the case of a probability measure you could go with a beta distribution or with a logit/probit approach and use the student_t approach.

wcbeard · March 5, 2020, 6:24pm

Thank you, I’m interested in the general solution of rate being a real, and from reading that description it looks like student_t would be good for modeling my case.

For the probability case, when you say

a logit/probit approach and use the student_t approach

do you mean something like

x ~ student_t(...);
y ~ bernoulli_logit(x);

Thanks again

andre.pfeuffer · March 5, 2020, 6:30pm

I meant something like this.
Take a look also at the beta distribution, it worth knowing about providing a good example:

Topic		Replies	Views
Scaling of the overdispersion in negative binomial models brms	23	4971	May 25, 2021
Beta-binomial with 2 data sources General techniques	7	761	January 5, 2021
Modelling small-sample, potentially skewed data with regression where CLT doesn't apply Modeling	1	710	May 13, 2021
Marginalizing out unknown binomial sample size when observations are proportions Modeling	4	495	August 15, 2023
Binomial function and population model providing "real" population size Modeling specification	12	1014	July 31, 2020

Modeling variation when aggregated sample size is small

Related topics