 # Building a Binomial Model on Sparse Survey Data

Hello all, first post here and relatively new to Stan. I’ve build a few models using this framework now and this forum has been a fantastic resource, but I haven’t been able to find an analog to what I’m working on now…

I have built the following model which is designed to track “awareness” from survey data. The input data is the number of surveys completed in a time period `counts` and the number of respondents that have responded they are aware in the survey `awares`. In addition to varying in time, I have further separated the data into geographic regions for each survey respondent, so the data is sized T x G, where T is time and G is geographic regions.

The issue I am trying to solve with the data is that in some instances there are very few survey and/or very few respondents are “aware” (generally awareness is low in this use case). So, I am trying to turn to modeling this data as a hierarchical binomial framework, where the geo-level binomial “success/aware” probability is linked back to a global success as a prior which is itself time-varying as an AR(1) model. The whole model is supplied below:

``````data {
int<lower=1> T;  // number of time-steps
int<lower=1> G;  // number of geos

int<lower=0> counts[T, G];  // number of survey responses
int<lower=0> awares[T, G];  // number of aware responses
}

parameters {
matrix<lower=0, upper=1>[T, G] p_geo;  // probability in each geo of awareness

// time variation inherited from p_global, with variance sigma
real<lower=0> sigma;  // variance of p_geo to p_global
real delta;  // represents an autoregressive coefficient on the previous beta
real mu;
}

transformed parameters {
vector<lower=0, upper=1>[T] p_global;  // time-varying global probability of awareness
// give p_global a random walk element
p_global = 0.1;

for (t in 2 : T) {
p_global[t] = mu + p_global[t-1] * delta;  // AR(1) model
}
}

model {

for (t in 1 : T) {
p_geo[t, :] ~ normal(p_global[t], sigma);

for (g in 1 : G) {
//P_binomial[t,g] = choose(counts[t,g], awares[t,g]) * pow(p_geo[t,g], awares[t,g]) * pow(1-p_geo[t,g], counts[t,g]-awares[t,g])
// Here we relate the the model awareness probability to the number of
// aware individuals we see empirically
awares[t,g] ~ binomial(counts[t,g], p_geo[t,g]);
}
}

sigma ~ beta(3,3);
delta ~ normal(0, 1);
mu ~ normal(0, 0.5);
}
``````

This specification seems to run fine, however, in my posterior draws I see plenty of samples in `p_geo`, but very few in `p_global`. I’m afraid there isn’t a lot of time variation in my raw signals, either, so I am thinking this may be due to autocorrelation in the aggregate data, however I am looking for advice. Mixing for these parameters, and a few others (sigma: Rhat ~= 1.5) is also unacceptably large.

Further, I only have about 1500 total time/geo survey data point samples to feed into the model, and I am afraid this structure might be overparameterized for this data set. It is possible I could eliminate the geo-specification of the model, which would reduce insight, but would also reduce complexity drastically. Is this advisable?