This blog post talks about insensitivity to sample size bias and uses an example of the smallest and largest counties having the highest cancer ‘rates’ (# cancer incidences / county population).

I simulated this data with python (with # of cancer cases coming from `binomial(Population, .02)`

),

```
import pandas as pd, numpy as np, scipy.stats as sts, altair.vegalite.v3 as A
nz = A.Scale(zero=False)
n_counties = 500
df = (
pd.DataFrame(
dict(county_pop=(np.exp(np.random.randn(n_counties) * 3 + 10)).astype(int))
)
.query("county_pop < .5e6 & county_pop > 1e4")
.assign(n_cancer_cases=lambda x: sts.binom.rvs(x.county_pop, 0.02))
)
df["cancer_rates"] = df.eval("n_cancer_cases / county_pop")
```

and it ends up looking like this:

As in the blog post, the smaller counties show the largest variation in the *rates* of cancer, but the mean rate is constant. Are there techniques to model this variation? In the past I’ve simply done things like have \sigma vary inversely with the population size.

Here’s an example program:

```
data {
int<lower=0> N;
vector[N] inv_sqrt_pop;
vector[N] rate;
real scale;
}
parameters {
real sig_alpha;
real<lower=0> sig_beta;
real<lower=0> sig_sig;
vector<lower=0>[N] sigma;
real<lower=0, upper=1> rate_mu;
}
model {
vector[N] sig_mu = sig_alpha + sig_beta * inv_sqrt_pop;
sig_alpha ~ normal(0, scale);
sig_beta ~ normal(0, scale);
sig_sig ~ normal(0, scale);
sigma ~ normal(sig_mu, sig_sig);
rate ~ normal(rate_mu, sigma);
}
```

Is there any distribution or technique that would get closer to modeling the underlying data generating process? A downside of this is that it seems to depend heavily on the number of observations I have. Also, I’m not sure my current method sufficiently captures the relationship between small county size and the variation in the rate.

Thank you

Chris

Welcome back wcbeard,

in your post, you’ve been talking about a binomial distribution, but your Stan code shows

only normal distributions. I’m not so sure, why is this?

You may use a `beta`

distribution,

`Y ~ beta(rate * scale, (1.0 - rate) * scale)`

The parameter `rate`

your proportion and `scale`

defines how narrow/spread your data is.

If you want to choose Y continuous then you may choose a student_t distribution and let

parameter `nu`

vary.

Hi andre, thanks for the response!

in your post, you’ve been talking about a binomial distribution, but your Stan code shows

only normal distributions

Sorry, I probably wasn’t clear enough. I just wanted to use the binomial distribution to generate the data, but my interest is more where

- I don’t know the exact process that generated the data
- but I know that some observations will be associated with smaller sample sizes, and I should thus expect them to have more variance

What I have a hard time with is that when estimating a parameter like the rate, I want the observations that aggregate larger numbers to have more weight, and the smaller ones to have less weight.

As an example, since Los Angeles county has a very large population, I want the observed rate to count more than the tiny Alpine County observed rate. I also want the model to take into account that the Alpine rate could have a higher variance.

But I believe the way Stan works, as I naively code it, is that each data point/observation gets equal weight, despite the fact that LA county represents a significantly higher population.

If you want to choose Y continuous then you may choose a student_t distribution and let

parameter `nu`

vary

Thank you for that pointer. Does `student_t`

offer an advantage over the normal in this case, or is it simply another option to try?

Thank you very much!

Chris

Is your data a `probability`

or just a `real number`

?

I case of a real number the `student_t`

with \nu = n -1 with degrees of freedom represents n observations, and thus for small \nu reflects a larger variance. See wikipedia.

In the case of a probability measure you could go with a `beta`

distribution or with a logit/probit approach and use the `student_t`

approach.

Thank you, I’m interested in the general solution of rate being a `real`

, and from reading that description it looks like `student_t`

would be good for modeling my case.

For the probability case, when you say

a logit/probit approach and use the `student_t`

approach

do you mean something like

```
x ~ student_t(...);
y ~ bernoulli_logit(x);
```

Thanks again

I meant something like this.

Take a look also at the beta distribution, it worth knowing about providing a good example: