Thank you @ldeschamps and @martinmodrak. I too prefer the beta distribution for the reasons you stated, but thought I should consider other alternatives.
My response variable is (roughly speaking) the percentage of species in common between pairs of points. So, each observation is the percentage of species in common between a point A and a point B. I am working with observations taken in 11 eleven locations, which means that observations taken in the same location will be more similar to each other than to observations taken on other locations. Aside from this, there is also second “issue”. Observations calculated using the same point will not be independent. The percentage of species in common between point A and point B will not be independent from the percetange of species in common between point A and C.
To deal with the first issue I fitted a model with different intercepts and slopes (per location). To deal with the second issue I added a varying intercept for observations calculated with the same point.
Here’s the code that I used to fit the model with the rethinking package:
m3<-ulam(alist(
sor~beta(alpha,beta),
alpha <- mu * phi,
beta <- (1-mu)*phi,
logit(mu)<-a_samp[sample_id]+a_basin[basin_id]+bnet[basin_id]*net_dist_s+bstrah[basin_id]*strahler_s+
bpp[basin_id]*pp_s+bdrain[basin_id]*da_diff_s+bflow[basin_id]*flow_distB,
c(a_basin,bnet,bstrah,bpp,bdrain,bflow)[basin_id]~multi_normal(c(a,bn,bs,bp,bd,bf),Rho,sigma_basin),
a~normal(0,10),
c(bn,bs,bp,bd,bf)~normal(0,1),
sigma_basin~exponential(2),
Rho ~ lkj_corr(2),
a_samp[sample_id]~normal(0,sigma_samp),
sigma_samp~exponential(2),
phi~exponential(0.1)
),data=df,warmup=2000,iter=5000,chains=2,cores=4,control = list(adapt_delta = 0.99),log_lik = TRUE
)
The corresponding STAN code is:
data{
vector[7846] sor;
int flow_distB[7846];
vector[7846] da_diff_s;
vector[7846] pp_s;
vector[7846] strahler_s;
vector[7846] net_dist_s;
int basin_id[7846];
int sample_id[7846];
}
parameters{
vector[11] bflow;
vector[11] bdrain;
vector[11] bpp;
vector[11] bstrah;
vector[11] bnet;
vector[11] a_basin;
real a;
real bf;
real bd;
real bp;
real bs;
real bn;
vector<lower=0>[6] sigma_basin;
corr_matrix[6] Rho;
vector[327] a_samp;
real<lower=0> sigma_samp;
real<lower=0> phi;
}
model{
vector[7846] alpha;
vector[7846] beta;
vector[7846] mu;
phi ~ exponential( 0.1 );
sigma_samp ~ exponential( 2 );
a_samp ~ normal( 0 , sigma_samp );
Rho ~ lkj_corr( 2 );
sigma_basin ~ exponential( 2 );
bn ~ normal( 0 , 1 );
bs ~ normal( 0 , 1 );
bp ~ normal( 0 , 1 );
bd ~ normal( 0 , 1 );
bf ~ normal( 0 , 1 );
a ~ normal( 0 , 10 );
{
vector[6] YY[11];
vector[6] MU;
MU = [ a , bn , bs , bp , bd , bf ]';
for ( j in 1:11 ) YY[j] = [ a_basin[j] , bnet[j] , bstrah[j] , bpp[j] , bdrain[j] , bflow[j] ]';
YY ~ multi_normal( MU , quad_form_diag(Rho , sigma_basin) );
}
for ( i in 1:7846 ) {
mu[i] = a_samp[sample_id[i]] + a_basin[basin_id[i]] + bnet[basin_id[i]] * net_dist_s[i] + bstrah[basin_id[i]] * strahler_s[i] + bpp[basin_id[i]] * pp_s[i] + bdrain[basin_id[i]] * da_diff_s[i] + bflow[basin_id[i]] * flow_distB[i];
mu[i] = inv_logit(mu[i]);
}
for ( i in 1:7846 ) {
beta[i] = (1 - mu[i]) * phi;
}
for ( i in 1:7846 ) {
alpha[i] = mu[i] * phi;
}
sor ~ beta( alpha , beta );
}
generated quantities{
vector[7846] log_lik;
vector[7846] alpha;
vector[7846] beta;
vector[7846] mu;
for ( i in 1:7846 ) {
mu[i] = a_samp[sample_id[i]] + a_basin[basin_id[i]] + bnet[basin_id[i]] * net_dist_s[i] + bstrah[basin_id[i]] * strahler_s[i] + bpp[basin_id[i]] * pp_s[i] + bdrain[basin_id[i]] * da_diff_s[i] + bflow[basin_id[i]] * flow_distB[i];
mu[i] = inv_logit(mu[i]);
}
for ( i in 1:7846 ) {
beta[i] = (1 - mu[i]) * phi;
}
for ( i in 1:7846 ) {
alpha[i] = mu[i] * phi;
}
for ( i in 1:7846 ) log_lik[i] = beta_lpdf( sor[i] | alpha[i] , beta[i] );
}
I just started working with Bayesian statistics and I don’t know anyone who can check my work, so I’m very happy that you can take the time to look at this.