Ragged arrays and time series operations


#1

Hi,

I’m working on an election state-space model similar to the Peter Ellis example posted recently on Andrew’s blog, but expanded to include many elections instead of just one (and a different country). I’m running into some difficulty using ragged arrays, both in the different number of polls from each pollster and the different number of days between elections.

If I’m thinking about estimating a single party’s vote share in each of the 10 elections below (where n_days is the number of days since the last election), then I’d have to estimate a parameter mu of length 9442.
image

I’m trying to use the segment slicing suggested in chp 16 of the manual, but confused about how to incorporate a time series element. For example, if I wanted to mu parameter to be modeled as a random walk from the previous day in the same election, I imagined I would need something like below. This runs into two problems: first, I’m not allowed to declare an int in the transformed parameters block; second, stan doesn’t seem to like my attempt at indexing the result of the segment() function (segment(mu, pos, n_days[k])[1]). I’m trying to assign the first value of each election season to the previous result (mu_start), then a random walk after that. Is there a better way to proceed?

transformed parameters{
  vector[total_days] mu;
  int pos;
  pos = 1;
  for (k in 1:N_election) {
    segment(mu, pos, n_days[k])[1] = mu_start[k,2];
    for (t in 2:n_days[k]){
      segment(mu, pos, n_days[k])[t] = segment(mu, pos, n_days[k])[t-1]+segment(innovation, pos, n_days[k])[t];
    }
        pos = pos + n_days[k];
  }
}

model{
  innovation ~ student_t(4, 0, 1);
  
  int pos2;
  pos2 = 1;
  for (k in 1:N_poll_elec_prov_id) {
    segment(y, pos2, s[k]) ~ normal(mu[segment(day_index, pos2, s[k])], y_se*sd_inflator);
  pos2 = pos2 + s[k];
}
}

#2

The data and parameters block in case that is helpful:

data{
  
 int N; //number of unique polls
 int N_election; // number of elections max(election_order)
 int N_poll_elec_prov_id; //number of unique poll-election-region pairs
 int s[N_poll_elec_prov_id]; //lengths of ragged pollster/elec/region arrays
 int total_days; //necessary length of mu (in future maybe can combine same election, for now keeping separate by region)
 int n_days[N_election]; //number of days in each election
 int day_index[N]; //index for matching polls to mu


 matrix[N_election,7] mu_start; //starting values from previous election

 real sd_inflator; //inflating the sd from each poll by same factor

 vector[N] y; //vector of polls 
 int y_se;
}

parameters{
  //real d[N_pollster]; //house effects
  vector[total_days] innovation;
}

#3

You can utilize an int that is declared at the top of a local block inside the transformed parameters block, as in

transformed parameters {
  vector[total_days] mu;
  {
    int pos = 1;
    for (k in 1:N_election) {
       // define mu
    }
  }
}

#4

Ah, nice! Thanks!


#5

Any thoughts on the time series definition of ragged arrays? I’m still struggling with that one.


#6

You can’t assign to segment (or any other function). You would have to use the colon slice operator.


#7

Ok, I’ve re-written without the segment function in the transformed parameters block and it seems ok syntactically, but I’m getting the following error:
"Rejecting initial value:
Error evaluating the log probability at the initial value.
validate transformed params: mu[3] is nan, but must be greater than or equal to 0"
What might be causing it to throw this error?

I’m also getting this warning:
“WARNING: left-hand side variable (name=mu) occurs on right-hand side of assignment, causing inefficient deep copy to avoid aliasing.”

transformed parameters{
  vector<lower=0,upper=1>[total_days] mu;
  {
    int pos = 1;
    for (k in 1:N_election) {
      mu[pos] = mu_start[k,2];
      mu[(pos+1):n_days[k]] = mu[ pos: (n_days[k]-1)]+innovation[(pos+1):n_days[k]];
      pos = pos + n_days[k];
  }
  }
} 
model{
  innovation ~ student_t(4, 0, 1);
  
  {
  int pos2= 1;
  for (k in 1:N_poll_elec_prov_id) {
    segment(y, pos2, s[k]) ~ normal(mu[segment(day_index, pos2, s[k])], y_se*sd_inflator);
  pos2 = pos2 + s[k];
}
}
}

#8

I think I got it figured out, realized that I didn’t need the segment statement in the transformed parameters block - could transform with a loop instead. I think the vectorization that was important was in the model block instead, which also needed editing so that the day_index would pick the right days out of mu. Thanks for your help!

transformed parameters{
  vector[total_days] mu;
  {
    int pos = 1;
    for (k in 1:N_election) {
      mu[pos] = mu_start[k,2];
      for (t in (pos+1):(pos+n_days[k]-1)){
        mu[t] = mu[t-1] + innovation[t];
    }
      pos = pos + n_days[k];
  }
  }
}

model{
  innovation ~ student_t(4, 0, 1);
  
  {
  int pos2= 1;
  for (k in 1:N_poll_elec_prov_id) {
    segment(y, pos2, s[k]) ~ normal(mu[day_index[pos2:s[k]]], segment(y_se, pos2, s[k]));
    pos2 = pos2 + s[k];
}
}
}

#9

That’s the one that’s important for speed. Loops are fast in and of themselves—they just compile to C++. When we loop over distributions, we lose opportunities to share repeated calculations (like log(sigma) in a log normal density) and have to reduce to a less efficient form of autodiff expression graph where we wind up having to essentially do more interpreted (dynamically resolved, rather than compiled and statically resolved) operations at runtime.


#10

Thanks Bob and Ben. I’ve got the model working on a limited set of data, but it’s slow and when I expand to my full data set it takes a very long time and usually runs into problems (e.g. Error in unserialize(socklist[[n]]) : error reading from connection). Because of the ragged array set up the mu matrix I’m estimating is about 37000x4. I was getting the inefficient deep copy warning because I had mu[t,]= mu[t-1,] and now have a loop specifying the columns which helped immensely on a small dataset, but there must be more things I’m doing inefficiently in the stan model. Does anything I’m doing look especially egregious or non-best practice? One thought I had was that I think I’m specifying the prior for the final day of each election twice - once through the innovation and once through mu_finish.

data{
  
//read in dimensions
 int N; //number of unique polls
 int p; //number of parties 
 int N_election; // number of elections max(election_order)
 int N_poll_elec_prov_id; //number of unique poll-election-region pairs
 int s[N_poll_elec_prov_id]; //lengths of ragged pollster/elec/region arrays
 int total_days; //necessary length of mu 
 int n_days[N_election]; //number of days in each election
 int day_index[N]; //index for matching polls to mu

 int N_pollster; //number of pollsters for house effects
 int pollster_id[N_poll_elec_prov_id];
 int last_day_index[N_election-1]; 
 matrix[N_election,p] mu_start; //starting values from previous election
 matrix [N_election-1,p] mu_finish; //finishing values from election results

// actual values in polls, in 0-1 scale
  matrix[N,p] y; //matrix of polls - with p as number of parties
  matrix[N,p] y_se; //matrix of se - with p as number of parties
}

parameters{
  matrix[N_pollster,p] house_effect; //house effects
  matrix[total_days,p] innovation;
  real<lower=0> sigma;
  real<lower=0> sd_inflator[N_pollster]; //inflating the sd from each poll by same factor
  //real<lower=0> sd_inflator; //inflating the sd from each poll by same factor

}

transformed parameters{
  matrix[total_days,p] mu;
  {
    int pos = 1;
    for (k in 1:N_election) {
      mu[pos,] = mu_start[k,];
      for (t in (pos+1):(pos+n_days[k]-1)){
        for (i in 1:p){
        mu[t,i] = mu[t-1,i] + innovation[t,i]*sigma;
}
    }
      pos = pos + n_days[k];
  }
  }
}

model{
  for (i in 1:p){
    innovation[,i] ~ student_t(4, 0, 1);
     mu_finish[,i] ~ normal(mu[last_day_index,i], 0.0001);
    house_effect[,i] ~ normal(0,.05);
  }
  sigma ~ normal(0.001,0.001);
  sd_inflator ~ normal(1,5);
  


  {
  int pos2= 1;
  for (k in 1:N_poll_elec_prov_id) {
    for (i in 1:p){
    segment(y[,i], pos2, s[k]) ~ normal(mu[segment(day_index, pos2, s[k]),i]+house_effect[pollster_id[k],i], sd_inflator[pollster_id[k]]*segment(y_se[,i], pos2, s[k]));
    }
    pos2 = pos2 + s[k];
}
}
}

#11

You want to multiply matrix innovation by scalar sigma and then assign by slicing.

  matrix[total_days, p] innov_sigma = innovation * sigma;
  ...
  mu[t] = mu[t - 1] + innovation_sigma[t];

Now it’d be much more efficient if mu and innovation_sigma are arrays of vectors—it’d remove all the copying.

The priors can be vectorized as

to_vector(innovation) ~ student_t(4, 0, 1);
to_vector(house_effect) ~ normal(0, 0.05);

Now it can be inefficient if those house effects are that tiny because adaptaiton has work work it out. So insteand, you can do

to_vector(house_effect_std) ~ normal(0, 1);
house_effect = 20 * house_effect_std;

That takes the pressure off adaptation by starting it in a good place.

normals with 0.001 scales are really going to be bad news. Same deal with all those, which means rescaling mu and sigma.

I find the array slicing notation easier than slicing, so

y[pos2:pos2 + s[k] - 1, i] ~ ...

Not more efficient, but pos2 = pos2 + s[k] can now be pos2 += s[k].

This sd_inflator[pollster_id[k]]*segment(y_se[,i], pos2, s[k]) can probably be made more efficient by doing the multiplication outside in some controlled way rather than one element at a time. But I’m out of steam on this one.


#12

Thanks so much Bob!


#13

I’ve made changes you’ve suggested but I’m struggling a bit with the arrays of vectors notation a bit. I think I’ve got mu and innov_sigma assigned correctly as arrays of vectors, but in my model block I’m getting the following error. Does this mean I’m trying to add an array and a row vector and stan doesn’t like that?

No matches for: 

  row vector[] + row vector


expression is ill formed
  error in 'model_ondp_polling_aggregation_v5' at line 70, column 104
  -------------------------------------------------
    68:   int pos2= 1;
    69:   for (k in 1:N_poll_elec_prov_id) {
    70:     y[pos2:(pos2 + s[k]-1),] ~ normal(mu[day_index[pos2:(pos2 + s[k]-1)]]+house_effect[pollster_id[k],], sd_inflator[pollster_id[k],]*y_se[pos2:(pos2 + s[k]-1),]);
                                                                                                               ^
    71:     pos2 += s[k];

Here is the model code:

transformed parameters{
  row_vector[p] mu[total_days];
  row_vector[p] innov_sigma[total_days];
  matrix[N_pollster,p] house_effect = house_effect_std/20;
  real<lower=0> sigma = sigma_std/1000;
  for (i in 1:total_days){
    innov_sigma[i]= sigma * innovation[i,];
  }
  {
    int pos = 1;
    for (k in 1:N_election) {
      mu[pos] = mu_start[k,];
      for (t in (pos+1):(pos+n_days[k]-1)){
        mu[t] = mu[t-1] + innov_sigma[t];

    }
      if (k<N_election){
        mu[n_days[k]] = mu_finish[k,];
      }
      pos = pos + n_days[k];
  }
  }
}

model{
  to_vector(innovation) ~ student_t(4, 0, 1);
  to_vector(house_effect_std) ~ normal(0,1);
  sigma_std ~ normal(1,1);
  sd_inflator ~ normal(1,5);

  {
  int pos2= 1;
  for (k in 1:N_poll_elec_prov_id) {
    y[pos2:(pos2 + s[k]-1),] ~ normal(mu[day_index[pos2:(pos2 + s[k]-1)]]+house_effect[pollster_id[k],], sd_inflator[pollster_id[k],]*y_se[pos2:(pos2 + s[k]-1),]);
    pos2 += s[k];
}
}
}

#14

It’s trying to add an array of row vectors (row vector[]) to a row vector (row vector). There’s not a function to do that. But then there’s not a function to add a matrix to a row vector either, so I’m not quite sure what you’re trying to do. Are you adding the second row vector to each element of the first row vector?


#15

In the previous block I had matrix ~ normal(,) which I think is wrong, here I’ve got it back to looping through the columns (1:p) but still getting a similar error (now it’s real[] + real instead of row vectors). I wrote a toy example of what I think is happening below the model, I don’t understand how to access the array of reals correctly.

  {
  int pos2= 1;
  for (k in 1:N_poll_elec_prov_id) {
    for (i in 1:p){
    y[pos2:(pos2 + s[k]-1),i] ~ normal(mu[day_index[pos2:(pos2 + s[k]-1)],i]+house_effect[pollster_id[k],i], sd_inflator[pollster_id[k],i]*y_se[pos2:(pos2 + s[k]-1),i]);
    }
    pos2 += s[k];
}
}


mu = [[a]
      [e]]

he = [1]
y[1
  2] ~ normal([[a+1]
               [e+1]], sd)

#16

Single indexes reduce dimensionality; multiple indexes maintains dimensionality. So the result of y[multi-index, single-index] is a one dimensional array. In your case, that array is defined by

y[pos2:(pos2 + s[k]-1),i] [n] =def= y[pos2 + n - 1, i]

The problem ou have is that you’re trying add real arrays and real, which won’t work. For matrix arithmetic like addition, you need vector data types. The problem comes when you try to add mu[day_index[multi-index]], whihc produces an array, with house_effect[single-index, single-index], which produces a scalar.