 About how to estimate the finite population standard deviation!

Hi all!
I have a question on how to calculate the finite population standard deviation. In the chapter 21 and 22 of the book“Data analysis using regression and multilevel/hierarchical models”.the finite population SD was calculated simply by
sd(α_j), where α_j is the estimated value for each level of a factor. Therefore, if a factor just have two level, the finite population SD is sd(α_1,α_2). But if I want to evaluated the contribution of this factor to the variation of the response variable Y ( as shown in Figure 22.4 and 22.5), should we put α_j back to the original data set to calculate the finite population SD?
For example if the data set have 10 rows, and a factor X only have two levels, the first level have 6 rows, and the second level have 4 row, then should we calculate the finite population SD as sd(α_1,α_1,α_1,α_1,α_1,α_1,α_2,α_2, α_2,α_2), or just by sd(α_1,α_2)? I think sd(α_1,α_2) might be inappropriate, because the sd_error was calculated by 10 value :
ey[i]=(y[i]-y_ha[i]), and sd_error=sd(ey[i]),
Therefore, put α_j back to the original dataset might by more appropriate for calculating the finite population SD and the contribution of this factor to variation of the response variable. Do you think so?
I raise this question because I see on a web site Tutorial 8.2b - Comparing two populations (Bayesian), the finite population SD was calculated as sd(α_1,α_1,α_1,α_1,α_1,α_1,α_2, α_2, α_2, α_2), it is apparent that the two methods will get different results.
In the attachment, I uploaded an example of the two methods with stan code and data, using the example of Figure 22.4 in your book.

see the difference of the two methods in the following example model

data {
int<lower=0> J;
int<lower=0> N;
int<lower=1,upper=J> county[N];
vector[N] y;
}
parameters {
vector[J] a;
real mu_a;
real<lower=0,upper=100> sigma_a;
real<lower=0,upper=100> sigma_y;
}
transformed parameters {
vector[N] y_hat;
for (i in 1:N){
y_hat[i] = a[county[i]];
}
}
model {
mu_a ~ normal(0, 1);

a ~ normal(10 * mu_a, sigma_a);
y ~ normal(y_hat, sigma_y);
}
generated quantities {
real<lower=0> s_a;
real<lower=0> s_a_2;
real<lower=0> s_y;
vector[N] e_y;
vector[N] est_county;

for (i in 1:N){
est_county[i] = a[county[i]];
}

e_y = y - y_hat;
s_a = sd(a);// the sd of 85 values
s_a_2=sd(est_county);// the sd of 919 vlaues
s_y = sd(e_y);// the sd of 919 vlaues
1 Like

Hi,
if I understand your question correctly, than the different approaches actually correspond to different questions and you really need to understand what is more relevant to your specific use case.

In your code, s_a is what would AFAIK correspond to the “finite population” variance as discussed in the blog. This is sensible if you don’t ever expect to extrapolate to new counties and want to know how much you would expect two random counties to differ.

It might also be interesting to inspect the “infinite population” variance, i.e. look at sigma_a directly (with 85 counties it is likely the difference between s_a and sigma_a will be small). This says how much you would expect two counties (which might include counties you have not observed yet) to differ.

s_a_2 tells you have much variance in your specific dataset can be attributable to county. In this sense it puts more weight on counties that are represented more frequently in your data. In the extreme, if one county correspond to almost all observations and other counties would have just one or two observations, it would become close to zero.

All of those are reasonable quantities that would be suited for different underlying questions.

Does that make sense?

Thanks very much! You solved my problem and I am agree with you! Best wishes!