Polling model with latent correlated random walk

Dear all,

I’m looking for some modeling advice for the following setup to model daily polls at a national level. I’m first going to describe the setup, then present the model and define the problem, and then ask a couple of specific questions.

The Setup

I have a set of pre-election opinion polls. There are 5 candidates in this election. Any given poll presents a comparison between a subset of candidates (e.g. the 1st candidate v. the 2nd, or the 2nd v. the 3rd), but only a small subset present a full choice-set.

(Toy) Example:
Day 1: Biden = 47%, Trump = 45%, Jill Stein = 3%, Other = 5%;
Day 2: Biden = 50%, Trump = 50%;
Day 3: Biden = 50%, Trump = 45%; Other = 5%;
… etc.

I want to be able to inform an estimate of the underlying vote-choice probabilities for all 5 candidates using all the polls – those using the partial choice-sets as well as the full choice set.

The information from each poll can be broken down into 2 parts:

  1. the support of the candidates relative to each other;
  2. the effect of the specific comparison (i.e. if Jill Stein stays in the set, does this have an independent effect on the choice made by voters).

Let’s ignore (2) ( it is taken care of by choice-specific random effects) , and assume each poll shows us genuine relative support per candidate independent of choice-set, and that respondents are unaffected by the choice-set (I believe this is the `Independence of Irrelevant Alternatives’ assumption).

I want my model to estimate the correlation between any two candidates. When I see the poll on day 2, I want my model to understand that this represents a relative gain of Trump over Biden, and predict the appropriate share for Stein and Other according to the correlations between the candidates.

If Trump and Other vote are negatively correlated, for example, I’d like to see the predicted other-share for that day decrease, even if we did not see a single poll with `Other’ in the choice-set for that day.

The Model

I have tried to model this via a poisson likelihood with a latent multivariate random-walk as follows:


 data{
    int<lower = 1> N; // total n. candidate measurements
    int Y[N]; // daily vote counts
    int n[N]; // daily sample-size
    int choice_id[N]; // candidate id
    int<lower=1> choice_N; // total number of candidates under consideration
    int dte_id[N]; // days-to-election id
    int<lower=1> dte_N; // max number of days-to-election
    int set_id[N]; // candidate-set id
    int<lower=1> set_N; // max number of different candidate-sets
    real<lower = 0> lkj_prior; // prior for the strength of candidate correlatioin
}

parameters {
    matrix[choice_N,set_N] gamma; // candidate choice effect
    vector<lower=0>[set_N] gamma_scale; // candidate choice effect

    array[dte_N] matrix[choice_N,set_N] z; // auxiliary variable for multivariate random walk
    matrix<lower = 0>[choice_N,set_N] delta_scale; // day-to-day change
    array[set_N] cholesky_factor_corr[choice_N] L_Omega; // correlation between candidates
}

transformed parameters {
    vector[N] lambda; // latent support
    array[dte_N] matrix[choice_N,set_N] pi; // probability for each candidate 
    matrix[choice_N,set_N] gamma_star; // choice-level support on election day
    array[dte_N] matrix[choice_N,set_N] delta_star; // daily support per candidate
    array[set_N] matrix[choice_N,choice_N] L; // covariance matrix

    for(s in 1:set_N) {

// non centered param on gamma
    gamma_star[:,s] = gamma[:,s]*gamma_scale[s];

// define covariance matrix 
    L[s] = diag_pre_multiply(delta_scale[1:choice_N,s], L_Omega[s]);
    }

// multivariate random walk on daily support
    for(s in 1:set_N){
      for(c in 1:choice_N) {
        delta_star[1][c,s] = 0; // identify via corner constraint
      }
      for(t in 2:dte_N) {
        delta_star[t][1:choice_N,s] =
        delta_star[t-1][1:choice_N,s] +
        L[s] * z[t][1:choice_N,s];
    } }

// ensure sum to 1
    for(t in 1:dte_N){
      for(s in 1:set_N){
        pi[t][1:choice_N,s] = 
          softmax( 
            gamma_star[1:choice_N,s] +
            delta_star[t][1:choice_N,s] 
          );
    } }
    
// linear predictor
    for (i in 1:N) lambda[i] = n[i]*pi[dte_id[i]][choice_id[i],set_id[i]];
}

model {
    to_vector(gamma) ~ std_normal();
    gamma_scale ~ std_normal();

    for(t in 1:dte_N) to_vector(z[t]) ~ std_normal();
    to_vector(delta_scale) ~ std_normal();
    for(s in 1:set_N) L_Omega[s] ~ lkj_corr_cholesky(lkj_prior);

    Y ~ poisson( lambda ); // likelihood
}

I then calculate the daily probabilities by applying softmax to the estimated daily preferences:

\pi_{tj} = \frac{\mbox{exp}(\gamma^\star_{j, s = s*} + \delta^\star_{t,j, s = s*})}{ \sum_k \mbox{exp}(\gamma^\star_{k, s = s*} + \delta^\star_{t,k, s = s*})} .

Note here that the daily preferences \delta are estimated for each choice-set which is see in the polls, and s = s^* represents the `full’ choice-set, i.e. that which involves all of the candidates.

This model is generally quite good, and manages to recover the correct ‘unobserved’ change in support, or at least to include it in a 95% interval.

The Problem

I ran some tests on a simulated dataset. I fit the above model and another version without the correlation (so a simple random walk).

Unfortunately, it seems the above correlated model provides no improvement in fit (RMSE, Bias, correlation, coverage are all roughly the same for correlated and independent models).

Even when the underlying correlations from the simulated data are high, I don’t see any improvement in accounting for this correlation. I’m a bit puzzled about this (maybe I shouldn’t be ? )

I have observed that whatever the underlying correlation I try to model, this is mostly overtaken by the sum-to-1 effect of softmax, which imposes its own (mostly negative) correlation structure.

Questions

(1) Do you have any idea why this implementation of the model does not provide gains over the independent random walk implementation ? Is there something in the code that I have wrong ?

(2) Is it possible to model a situation in which time-series of probabilities which sum to 1 are assigned a correlation-structure of the sort I’m imagining ? so where a small probability is highly negatively correlated with a large-one ? Looking at the correlation structure implied by a multinomial distribution:

{\displaystyle \rho (X_{i},X_{j})={\frac {\operatorname {Cov} (X_{i},X_{j})}{\sqrt {\operatorname {Var} (X_{i})\operatorname {Var} (X_{j})}}}={\frac {-p_{i}p_{j}}{\sqrt {p_{i}(1-p_{i})p_{j}(1-p_{j})}}}=-{\sqrt {\frac {p_{i}p_{j}}{(1-p_{i})(1-p_{j})}}}.}

, which I understand arises as a result of the sum-to-1 constraint, It seems like changes in a small probability are always bound have minimal, undifferentiated impact on the larger probabilities ( please tell me if I’m wrong) .

Any other suggestions would be highly appreciated.

Many thanks in advance !
Roberto

Adding to the above, here is an example of (no real) differences between the correlated RWs model and the independent RWs model.

The underlying data is a series of correlated normals which sum to zero:

# Define parameters
n_time_points = 30
n_series <- sample(3:5,size = 1) # number of series

# Load necessary library
library(Matrix)

# Create initial correlation matrix
cor_matrix <- 
  matrix(
    runif(n =  (n_series-1)*(n_series-1),min = -1,max = 1),
    nrow = n_series-1, 
    ncol = n_series-1
    )
diag(cor_matrix) <- 1 

# Adjust for positive definiteness
cor_matrix <- nearPD(cor_matrix, corr = TRUE)$mat

# Standard deviations (scale) for each variable
scales <- runif(n = n_series-1,min = 0,max = 0.5)  

# Create a matrix of standard deviation products
sd_product <- outer(scales, scales)

# Convert correlation matrix to covariance matrix
cov_mat <- cor_matrix * sd_product

# first time point - sum to 0 to encourage reasonable behaviour
simulated_data = t(runif(n = n_series-1,min = -1,max = 1))

# subsequent points (random walk)
for(t in 2:n_time_points)  {
  simulated_data <-
    rbind(
      simulated_data,
      simulated_data[t-1,1:n_series-1] +
      mvrnorm(n = 1, mu = rep(0, n_series-1), Sigma = cov_mat)
    )
}

# ensure sum to 0
simulated_data <- cbind(simulated_data,-rowSums(simulated_data))

This is the correlation matrix of the simulated data:

> cor(simulated_data)
           [,1]       [,2]       [,3]
[1,]  1.0000000  0.6300943 -0.6538069
[2,]  0.6300943  1.0000000 -0.9995217
[3,] -0.6538069 -0.9995217  1.0000000

These are transformed into probabilities via softmax:

# Transform to probability scale using softmax
probability_data <- exp(simulated_data)/rowSums(exp(simulated_data))

I then simulate polls via multinomial distribution , and pick a random sample size n per day:

# random sample size per date
n = round(runif(n = n_time_points,min = 0,max = 2500))

Y =
  t(
    sapply(
      1:n_time_points,
      function(x){
        rmultinom(
          n = 1,
          size = n[x],
          prob = t(probability_data[x,])
        ) } ) )

Finally, to exaggerate the problem I’m trying to solve, I make the last 10 observations missing for all but 2 of the candidates:

# make last 15 days disappear for  all but 2 candidates 
Y[1:10,3:n_series] <- NA

For the remaining days I put a bunch of random missingness, to simulate the different choice-sets that pollsters present in their results:

# for each time-point drop one of the series at random
for(t in 1:n_time_points) {
  if(any(is.na(Y[t,]))){next}
  size <- sample( 0:(n_series-2),size = 1)
  if(size==0){next}
    Y[t,sample(1:n_series,size = size)] <- NA
    # ad minimum 2-way comparison
  }

The result looks like this:

      [,1] [,2] [,3]
 [1,]  859  772   NA
 [2,]  375  438   NA
 [3,]  533  827   NA
 [4,]  294  412   NA
 [5,]  247  317   NA
 [6,]  381  751   NA
 [7,]  729 1207   NA
 [8,]  404  620   NA
 [9,]  384  541   NA
[10,]  373  758   NA
[11,]   61   72   NA
[12,]  223  167  241
[13,]  274   NA  421
[14,]  341  215  441
[15,]  502  292   NA
[16,]  525  372   NA
[17,]   NA  337  303
[18,]  189  133  280
[19,]   23   17   30
[20,]    1    2    0
[21,]  639  394 1318
[22,]  560   NA 1043
[23,]   NA   77  600
[24,]  229   92   NA
[25,]  293   91  982
[26,]  385  104 1287
[27,]   NA   33  937
[28,]  172   27   NA
[29,]  183   25 1547
[30,]  138   12 1049

Here I plot them:
simulated_plots.pdf (10.6 KB)

I then setup the data in long format and feed it to Stan:


# derive choice_set
set = as.factor(apply(Y,1,function(x){paste0(which(!is.na(x)),collapse = '-')}))
set_id = as.integer(set)

sims_dt <-
  data.table(
    Y = as.numeric(Y),
    choice_id = rep(1:n_series,each = n_time_points),
    dte_id = rep(1:n_time_points,n_series),
    n = rep(n,n_series),
    set_id = rep(set_id,n_series)
    )

sims_dt <- sims_dt [complete.cases(sims_dt )]
sims_dt[,n:=sum(Y),by = c('dte_id')]

ljk_prior = 1

data_list =
  list(
    Y = sims_dt$Y,
    n = sims_dt $n,
    choice_id = sims_dt $choice_id,
    choice_N = n_series,
    dte_id = sims_dt $dte_id,
    dte_N = n_time_points,
    N = dim(sims_dt)[1],
    set_id = sims_dt$set_id,
    set_N = max(sims_dt$set_id),
    lkj_prior = ljk_prior
  )

I run the models on this dataset and get the following results:

> print(results_list)
         model avg_cor_size n_time_points n_series n_per_day mean_abs_bias       rmse       cor
1:  correlated     0.761141            30        3    1239.5    0.04164795 0.09029833 0.8868615
2: independent     0.761141            30        3    1239.5    0.04331779 0.09080962 0.8856843
       cover
1: 0.8666667
2: 0.8555556

Below are the plots of the estimates:
Correlated Estimates -

Independent Estimates -

In short - I’m not clear on how it is possible that the two models produce pretty much the same estimates, even when one accounts for correlation at the other doesn’t . Maybe there are things I’m overlooking… any help would be welcome.

Thanks for the detailed post. It makes it a lot easier to answer.

If you take a simplex parameter (non-negative values that sum to 1), then they are negatively correlated by construction—if one goes up, something else has to go down to maintain the sum to 1 property. If you use something like a Dirichlet(alpha) distribution, you can’t model correlation, so you have to move to something like a multivariate logistic normal: Logit-normal distribution - Wikipedia. It’s basically doing what you are suggesting, which is sonftmaxing a normal. You can use it on any kind of basis.

If you have a model that generates probabilities for all five candidates at every time point, you can just restrict to the ones that are in the poll. For example, if we have a probability simplex (c1, c2, …, c5) over the five candidates (non-negative, sums to 1), then if we take two of the candidates, say 2 and 4, we can just renormalize to get the simplex (c2, c4) / (c2 + c4). The likelihood here is multinomial, not Poisson (though there is a duality after you fix the total count, but I’d just go with multinomial because it’s the natural choice here). You can tell from how you simulated your data—I’d just have the Stan model match the generative process you’re using.

Without looking into the details, my offhand guess would be that you don’t have enough information to tightly estimate the correlations. Look at the posteriors on your correlation structure to see if it’s doing anything.

1 Like

Thanks for taking the time Bob, it’s very helpful. Given your response, I have some residual questions.

Re - the logistic-normal distribution:

If you take a simplex parameter (non-negative values that sum to 1), then they are negatively correlated by construction—if one goes up, something else has to go down to maintain the sum to 1 property. If you use something like a Dirichlet(alpha) distribution, you can’t model correlation, so you have to move to something like a multivariate logistic normal: Logit-normal distribution - Wikipedia. It’s basically doing what you are suggesting, which is sonftmaxing a normal. You can use it on any kind of basis.

I want to check that I understand the mechanism through which this allows for relaxing (to some degree) the negativity in correlations introduced by the sum-to-1 constraint.

  1. if the correlation structure remains that of the multinomial likelihood (which it should if my likelihood is multinomial/poisson and my link is softmax), the only way in which the underlying latent model can affect the correlation is by affecting the probability values \pi_1,...,\pi_J, given there is no separate correlation parameter.

  2. Hence, to the degree that the multivariate normal distribution on the latent variable \mu_i,...,\mu_J affects the correlation between counts, this will be via:

        a) its effect on the expected value of the latent variable;

        b) its effect on the underlying latent variance, where higher variance will push the probabilities to 1/J.

But here I have 2 competing intuitions:

  • On one level, from the definition of a multivariate normal, the expected value of is not affected by the presence of a correlation structure – so an independent set of random walks will have the same expected value as a correlated set of random walk, namely \mu_{t-1,j}. Hence I expect the correlation at the latent level to affect the correlation between counts only via its effect on the variance of \mu.

  • On the other hand, when we have multivariate normal in a different context – here I’m thinking of the conditionally autoregressive prior (CAR), we observe `shrinkage’ of a given estimated \mu towards the neighboring \mu. So here we observe a correlation structure which affects the expected values, and where its effect is proportional to the implied strength and direction of the correlation.

I guess the latter can be reconciled with the former if shrinkage in CAR is exclusively due to the shared variance constraint, in which case this shrinkage between highly correlated candidates can be implemented also in the logistic-normal via sharing the variance across candidates. Though I don’t have an intuition as to when/if this would be desirable.

Is the above logic correct ?

Relatedly, If there would be gains from using the logistic-normal with a correlation structure (v. the logistic-normal with independent random walks), would you expect them to appear in the RMSE or the Coverage ? I don’t have intuitions here because large variance on the latent level doesn’t translate linearly to large variance on the probability scale, and similarly it could affect RMSE if the variance explodes, given the pull towards 1/J - so i don’t expect uniformly better coverage.

Re - informing a correlation structure:

Without looking into the details, my offhand guess would be that you don’t have enough information to tightly estimate the correlations. Look at the posteriors on your correlation structure to see if it’s doing anything.

I agree with this – suppose I want to specify some strong, relatively tight and informative priors about a specific correlation structure. This information could come from say historical polling data. Moving this around could also help me test what sorts of correlation structures would be compatible with the daily swings in votes that I have in mind.

How would I pass these priors to the multivariate normal ? I can’t use the ljk prior I imagine, but on the other hand the basic multivariate normal is probably hugely inefficient in terms of sampling. Perhaps there is another way ?

Re - multinomial v. poisson likelihood:

If you have a model that generates probabilities for all five candidates at every time point, you can just restrict to the ones that are in the poll. For example, if we have a probability simplex (c1, c2, …, c5) over the five candidates (non-negative, sums to 1), then if we take two of the candidates, say 2 and 4, we can just renormalize to get the simplex (c2, c4) / (c2 + c4). The likelihood here is multinomial, not Poisson (though there is a duality after you fix the total count, but I’d just go with multinomial because it’s the natural choice here). You can tell from how you simulated your data—I’d just have the Stan model match the generative process you’re using.

My mind also went to the multinomial first, but I encountered problems in how to pass `partial’ multinomials to Stan — i.e. I have some polls with two candidates, some with three and some with 4, etc. How does my dependent variable look like ? Is it a list of matrices, which I pass to stan as:

array[set_N] int Y[dte_N,choice_N];

– but then Is there a way to have the multinomial distribution take a varying number of choices ?

I’d like to avoid having to have a separate dependent variable for each candidate set. In that case I have to specify a separate likelihood + link function for each element in the list, and I further have to keep track of the various indices for covariates.

This seemed to me like an unnecessary addition in complexity + code lines – given, as you say, the multinomial and poisson likelihood are the same for fixed n.

Is it still your intuition that this might be something worth trying ? If you think there might be gains there, I’d be keen to try it.

Many thanks again for taking the time.

I’m leaving this here for posterity in case someone finds it useful.

I ran a simulation study with S = 100 iterations. I tested 4 models, varying in correlation structures and link functions, against each other:
1 - independent - SoftMax;
2 - correlated - SoftMax;
3 - independent - Isometric Log-Ratio;
4 - correlated - Isometric Log-Ratio;

The Isometric Log-Ratio (ILR) is a link function used in compositional data. I wasn’t very familiar with it ( see Egozcue[1] , Filzmozer et al.[2] ) but I found it appealing for this problem because it has the property of generating representations of compositional variables on the Euclidean space which preserve the relationships (ratios) implied by the original scale. Unlike other log-ratio transformations, it is not equivalent to SoftMax, hence specifying correlations on the latent scale has the potential of learning something different from SoftMax.

In setting this up I simplified the data generating process to what amount to a simple imputation problem: a have a number n_series of multinomial counts, representing the results from opinion polls from different candidates, each series is n_time_points long, and for all series I have missing values at random days. For all but 2 of the series, the last \frac{1}{3} of the observations is missing. The challenge for the models is to reconstruct these values, along with the other randomly missing ones. The full code used for the DGP and the models can be found at the bottom. I will focus here on the results.

I find evidence that the correlated models outperform the independent ones, but in my view this is surprisingly thin… in principle I would have expected better performance on every simulation round, but this is not the case. ILR did not give major differences in performance, so I will focus on correlated v. independent SoftMax below (though ILR did produce substantially different predictions compared to SoftMax).

The results of the simulation are attached in this csv file:
simulation_results_stan.forum.csv (463.4 KB) . I aggregate by iteration to simply look at the average difference in performance between models in each simulation, rather than individual predictions within simulations, and focus on the average ability to predict the series with the `missing’ \frac{1}{3} of observations.

In ~ 70\% of the simulations, the RMSE is greater for the independent model than the correlated model. Similar results appear for absolute bias, correlation and coverage. The differences appear to be small on average, but become large in instances where the independent model `fails’. So one could be tempted to conclude the correlated model is more robust under a correlated DGP.

The average % gain in the predicted values is close to a percentage point (per candidate), which is small, though not totally negligible.

If I try and look at what characteristics of the simulation are associated with the difference in performance, I don’t get much information (100 simulations is too little to say anything definitive for subsets of observations, but the computational costs are elevated when the number of candidates considered is > 3, and / or the time points > 30, so I have to come up with a more efficient strategy before I simulate more). Below the results of a regression tree with default settings:


Clearly the magnitude of the correlation in the underlying data – as we would hope – is important when it comes to explaining the difference in performance between independent and correlated models. The sample size per day also plays a role, as do the total number of time points. I don’t put too much stock on any specific split considered by the tree, as the final nodes have very few observations, but the general idea that correlation, sample size and length of the series matter can be safely extrapolated, I think.

Finally, on the point that Bob made re - `the model does not perform well because it has not enough information properly learn the covariance’, I checked this for a couple of the simulations. The uncertainty around the correlation parameters is typically contained – the problem seems to be rather that the point estimates of these correlations are off, sometimes with an opposite sign, compared to the DGP.
I suppose this could be an issue still related to the lack of power – if I see samples for 30 time points and extrapolate a correlation from these, especially when some variable with large correlations but highly negative latent values will never be seen due to 0 sampling probability with small n, I’m likely to get my correlation structure wrong.

So what conclusions can I draw ?

I think it makes sense to fit the correlated model, though it does come at a heavy computational cost. I will expect better performance when:
(1) the underlying correlations are strong ;
(2) my sampling procedure is powerful enough to sample also candidates with extremely low prevalence ;
(3) I can observe a large number of time points.

Further, I can safely say the use of ILR does not guarantee better performance.

As always any thoughts are welcome !


Code for the simulations study

for(iter in 1:n_iter){

# Define parameters
n_time_points = sample(30:365,size = 1)

n_series <- sample(3:10,size = 1) # number of series

# Create initial correlation matrix
cor_matrix <- 
  matrix(
    runif(n =  (n_series)*(n_series),min = -1,max = 1),
    nrow = n_series, 
    ncol = n_series
  )
diag(cor_matrix) <- 1 

# with even moderately large n_series, you're likely to observe a reasonable 
# amount of correlation randomly. 
# we expect the biggest gains to be when average correlation size is very high, 
# and when the number of series is relatively small

# Adjust for positive definiteness
cor_matrix <- nearPD(cor_matrix, corr = TRUE)$mat

# Standard deviations (scale) for each variable
scales <- runif(n = n_series,min = 0,max = 0.25)  

# Create a matrix of standard deviation products
sd_product <- outer(scales, scales)

# Convert correlation matrix to covariance matrix
cov_mat <- cor_matrix * sd_product

# first time point - sum to 0 to encourage reasonable behaviour
x0 = runif(n = n_series,min = -1,max = 1)
simulated_data <- t(x0)#-mean(x0))

# subsequent points (random walk)
for(t in 2:n_time_points)  {
  simulated_data <-
    rbind(
      simulated_data,
      simulated_data[t-1,1:n_series] +
        mvrnorm(n = 1, mu = rep(0, n_series), Sigma = cov_mat)
    )
}

## ensure sum to 0
#simulated_data <- cbind(simulated_data,-rowSums(simulated_data))

# store average correlation
avg_cor_size <- mean(abs(cor(simulated_data)[lower.tri(cor(simulated_data))]))

# Transform to probability scale (0 to 1) using softmax
probability_data <- exp(simulated_data)/rowSums(exp(simulated_data))

prevalence = apply(probability_data,2,mean)
sd_pi = apply(probability_data,2,sd)

# random sample size per date
n = round(runif(n = n_time_points,min = 0,max = 2500))
n = ifelse(n==0,1,n)

Y =
  t(
    sapply(
      1:n_time_points,
      function(x){
        rmultinom(
          n = 1,
          size = n[x],
          prob = t(probability_data[x,])
        ) } ) )

# make last 15 days disappear for one of the series 
empty = 3:n_series # sort(sample(3:n_series,size = sample(1:(n_series-2),size = 1)))
Y[1:round((0.33*n_time_points)),empty] <- NA

# # # to evaluate reconstruction power, and overall estimation accuracy
# for each time-point drop one of the series at random

for(t in 1:n_time_points) {
  if(any(is.na(Y[t,]))){next}
  size <- sample( 0:(n_series-2),size = 1)
  if(size==0){next}
  Y[t,sample(1:n_series,size = size)] <- NA
  # ad minimum 2-way comparison
}
print(Y)


# Check the result
par(mfrow = c(2,2))

plot(
  simulated_data[,1],
  type = 'l',
  ylim = c(min(simulated_data),max(simulated_data)),
  col = 1,
  xlim = c(n_time_points,1) ,
  ylab = "Latent Support",
  xlab = "Time")
for(i in 2:n_series) {
  lines(simulated_data[,i], type = 'l', col = i)
}

plot(
  probability_data[,1],
  type = 'l',
  col = 1,
  ylim = c(0,1),
  xlim = c(n_time_points,1) ,
  ylab = "Probability",
  xlab = "Time")
for(i in 2:n_series) {
  lines(probability_data[,i], type = 'l', col = i)
}

plot(
  Y[,1],
  type = 'l',
  col = 1,
  ylim = c(0,max(Y,na.rm=TRUE)),
  xlim = c(n_time_points,1) ,
  ylab = "Observed Counts",
  xlab = "Time")
# highlight daily sampling effort with size of circles
points(Y[,1], col = 1,cex = 2*n/max(n))
for(i in 2:n_series) {
  points(Y[,i], col = i,cex = 2*n/max(n))
  lines(Y[,i], type = 'l', col = i)
}


sims_dt <-
  data.table(
    Y = as.numeric(Y),
    choice_id = rep(1:n_series,each = n_time_points),
    dte_id = rep(1:n_time_points,n_series),
    n = rep(n,n_series)
    )
sims_dt <- sims_dt[complete.cases(sims_dt )]

ljk_prior = 1

data_list =
  list(
    Y = sims_dt$Y,
    n = sims_dt $n,
    choice_id = sims_dt $choice_id,
    choice_N = n_series,
    dte_id = sims_dt $dte_id,
    dte_N = n_time_points,
    N = dim(sims_dt)[1],
    lkj_prior = ljk_prior,
    e = ilrBase(D = n_series,method = 'basic' )
  )

for(link in c('SoftMax','ILR')){
  for(type in c('correlated','independent')){
  
if(link == 'SoftMax'){
  link_it <- 'pi[t] = exp(mu[t])/sum(exp(mu[t]));'
}
if(link == 'ILR'){
  link_it <- "pi[t] = exp(e*mu[t][1:(choice_N-1)])/sum(exp(e*mu[t][1:(choice_N-1)]));"
}
  
if(type == 'correlated'){
model = paste0(
  "data{
    int<lower = 1> N; // total n. candidate measurements
    int Y[N]; // poll-level vote counts
    int n[N]; // poll-level sample-size
    int choice_id[N]; // candidate id
    int<lower=1> choice_N; // total number of candidates under consideration
    int dte_id[N]; // days-to-election id
    int<lower=1> dte_N; // max number of days-to-election
    real<lower = 0> lkj_prior; // prior for the strength of candidate correlatioin
    matrix[choice_N,choice_N-1] e; // ILR Basis
}

parameters {
    vector[choice_N-1] gamma; // candidate choice effect
    vector<lower=0>[choice_N-1] gamma_scale; // candidate choice effect

    array[dte_N] vector[choice_N-1] z; // auxiliary variable for multivariate random walk
    vector<lower = 0>[choice_N-1] delta_scale; // day-to-day change
    cholesky_factor_corr[choice_N-1] L_Omega; // correlation between candidates
}

transformed parameters {
    vector[N] lambda; // latent support

    array[dte_N] vector[choice_N] mu;
    array[dte_N] vector[choice_N] pi;
    
    vector[choice_N-1] gamma_star = gamma.*gamma_scale; // choice-level log-support on election day
    array[dte_N] vector[choice_N-1] delta_star; // daily log-support per candidate
    matrix[choice_N-1,choice_N-1] L; // covariance matrix

// define covariance matrix on daily changes in support
      L = diag_pre_multiply(delta_scale[1:(choice_N-1)], L_Omega);

// random walk on daily support
      for(c in 1:(choice_N-1)) {
        delta_star[1][c] = 0; // identify via corner constraint
      }
      for(t in 2:dte_N) {
        delta_star[t][1:(choice_N-1)] =
        delta_star[t-1][1:(choice_N-1)] +
        L * z[t][1:(choice_N-1)];
      }

    for(t in 1:dte_N){
        mu[t][1:(choice_N-1)] = 
            gamma_star[1:(choice_N-1)] +
            delta_star[t][1:(choice_N-1)];
      mu[t][choice_N] = 0;
      }
    
      for(t in 1:dte_N){
            ",link_it,"
    }

// linear predictor
    for (i in 1:N) lambda[i] = n[i]*pi[dte_id[i]][choice_id[i]];
}

model {
    to_vector(gamma) ~ std_normal();
    to_vector(gamma_scale) ~ std_normal();

    for(t in 1:dte_N) to_vector(z[t]) ~ std_normal();
    to_vector(delta_scale) ~ std_normal();
    L_Omega ~ lkj_corr_cholesky(lkj_prior);
    
    Y ~ poisson( lambda ); // likelihood
}
")

# parameters to monitor
pars <-
  c('gamma_star', # choice baseline
    'delta_star', # campaign dynamic effect
    'delta_scale',
    'pi',
    'L',
    'L_Omega'
  )

}
if(type == 'independent'){
model = 
paste0(
  "data{
    int<lower = 1> N; // total n. candidate measurements
    int Y[N]; // poll-level vote counts
    int n[N]; // poll-level sample-size
    int choice_id[N]; // candidate id
    int<lower=1> choice_N; // total number of candidates under consideration
    int dte_id[N]; // days-to-election id
    int<lower=1> dte_N; // max number of days-to-election
    real<lower = 0> lkj_prior; // prior for the strength of candidate correlatioin
    matrix[choice_N,choice_N-1] e; // ILR Basis
}

parameters {
    vector[choice_N-1] gamma; // candidate choice effect
    vector<lower=0>[choice_N-1] gamma_scale; // candidate choice effect

    array[dte_N] vector[choice_N-1] z; // auxiliary variable for multivariate random walk
    vector<lower = 0>[choice_N-1] delta_scale; // day-to-day change
}

transformed parameters {
    vector[N] lambda; // latent support

    array[dte_N] vector[choice_N] mu;
    array[dte_N] vector[choice_N] pi;
    
    vector[choice_N-1] gamma_star = gamma.*gamma_scale; // choice-level log-support on election day
    array[dte_N] vector[choice_N-1] delta_star; // daily log-support per candidate

// random walk on daily support
      for(c in 1:(choice_N-1)) {
        delta_star[1][c] = 0; // identify via corner constraint
      }
      for(t in 2:dte_N) {
        delta_star[t][1:(choice_N-1)] =
        delta_star[t-1][1:(choice_N-1)] +
        delta_scale[1:(choice_N-1)] .* z[t][1:(choice_N-1)];
    } 

    for(t in 1:dte_N){
        mu[t][1:(choice_N-1)] = 
            gamma_star[1:(choice_N-1)] +
            delta_star[t][1:(choice_N-1)];
        mu[t][choice_N] = 0;
      }
    
      for(t in 1:dte_N){
            ",link_it,"
    } 
// linear predictor
    for (i in 1:N) lambda[i] = n[i]*pi[dte_id[i]][choice_id[i]];
}

model {
    to_vector(gamma) ~ std_normal();
    to_vector(gamma_scale) ~ std_normal();

    for(t in 1:dte_N) to_vector(z[t]) ~ std_normal();
    to_vector(delta_scale) ~ std_normal();

    Y ~ poisson( lambda ); // likelihood
}
")

# parameters to monitor
pars <-
  c('gamma_star', # choice baseline
    'delta_star', # campaign dynamic effect
    'delta_scale',
    'pi'
  )

}

# fit model
fit_object <-
  stan(model_code = model,
       pars = pars,
       data = data_list,
       iter = 1000,
       warmup = 750,
       refresh = 1,
       thin = 4,
       cores = 8,
       chains = 8,
       control = list(max_treedepth = 20,
                      adapt_delta = 0.8),
       verbose = TRUE
  )

# check convergence
summary.fit <- summary(fit_object)$summary

pars_hat <- rstan::extract(fit_object, pars = pars, inc_warmup = FALSE)

S = dim(pars_hat$delta_star)[1]

# calculate candidate posterior predictive support
pi <-
  lapply(1:S,
         function(s){
           sapply(1:n_series,
                  function(c){
                    pars_hat$pi[s,,c]
                  } )
         } )

# calculate posterior predicitve probabilities
preds =
  data.table(
    choice_id = rep(1:n_series,each = n_time_points),
    dte_id = rep(1:n_time_points, n_series),
    obs = as.numeric(probability_data),
    missing = as.vector(is.na(Y)),
    pred_lo = as.numeric(sapply(1:n_series,function(c){
      apply(sapply(1:S,function(s){pi[[s]][,c]}),1,quantile,0.05)
    })),
    pred_mid = as.numeric(sapply(1:n_series,function(c){
      apply(sapply(1:S,function(s){pi[[s]][,c]}),1,quantile,0.5)
    })),
    pred_hi = as.numeric(sapply(1:n_series,function(c){
      apply(sapply(1:S,function(s){pi[[s]][,c]}),1,quantile,0.95)
    }))
  )

par(mfrow = c(3,2))
for(c in 1:n_series){
  plot(
    preds[choice_id == c]$pred_mid,
    col = NA,
    ylim = c(0,1),
    xlim = c(n_time_points,1)
  )
  lines(y = preds[choice_id == c]$pred_mid,x = 1:n_time_points,col = 'dodgerblue',lwd = 2)
  polygon(c(1:n_time_points, rev(1:n_time_points)), c(preds[choice_id == c]$pred_lo, rev(preds[choice_id == c]$pred_hi)), col = adjustcolor('skyblue',0.5), border = NA)
  lines(y = probability_data[,c],x = 1:n_time_points,col = 'black')
}

results <-
  data.table(
    iter = iter,
    model = type,
    link = link,
    avg_cor_size  = avg_cor_size,
    n_time_points = n_time_points,
    n_series = n_series,
    prevalence = prevalence, 
    sd_pi = sd_pi,
    n_per_day = sum(n)/n_time_points,
    bias = sapply(1:n_series,function(x){bias(y = preds[choice_id == x & missing]$obs,yhat = preds [choice_id == x & missing]$pred_mid)}),
    rmse = sapply(1:n_series,function(x){rmse(y = preds[which(choice_id == x & missing)]$obs,yhat = preds[which(choice_id == x & missing)] $pred_mid)}),
    cor = sapply(1:n_series,function(x){cor(x = preds [which(choice_id == x &missing)]$pred_mid,y = preds[which(choice_id == x &missing)]$obs)}),
    cover = sapply(1:n_series,function(x){cover(y = preds[which(choice_id == x &missing)]$obs,yhat_lo = preds[which(choice_id == x &missing)]$pred_lo,yhat_hi = preds[which(choice_id == x &missing)]$pred_hi)})
  )
results$empty = 'no'
results$empty[empty] = 'yes'
  
results_list <- rbindlist(list(results_list,results) )

print(results_list)

# save(results_list,file = 'generated_data/offset.model_simulations.results.RData',compress = TRUE)

gc()
} }

}

  1. Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G., & Barcelo-Vidal, C. (2003). Isometric logratio transformations for compositional data analysis. Mathematical geology , 35 (3), 279-300. ↩︎

  2. Filzmoser, P., Hron, K., & Reimann, C. (2010). The bivariate statistical analysis of environmental (compositional) data. Science of the Total Environment , 408 (19), 4230-4238. ↩︎