Feedback on neighborhood model

I am working on a model which involves calculating the neighborhood density of each observation, and then using that neighborhood density (ND) as a predictor term.
The ND of x_i is calculated as follows:

  • find the nearest neighbor of x_i, it is at some distance d
  • count how many other observations are withing d * factor , where factor is some real > 1

To calculate ND I use an NxM matrix, with N being the number of observations I’m modelling, and M the number of potential neighbors.

Now, so far I’ve been setting factor by trial and error, but I thought I’d give it a try building a model to estimate it.
The following is what I came up with:

functions{
  real rank_of(real [] vec, real val) {
    int Ns = size(vec);
    real res = 0.0;
    for (n in 1:Ns) {
      if (vec[n] < val) {
	res += 1;
      } else {
	break;
      }
    }
    return res;
  }
  vector find_neighborhood(matrix NNS, real nn_limit) {
    int Ns = rows(NNS);
    int Ms = cols(NNS);
    vector[Ns] nns;
    for (n in 1:Ns) {
      real row_n[Ms];
      real min_val_fac;
      row_n = to_array_1d(NNS[n]);
      min_val_fac = row_n[1] * nn_limit;
      nns[n] = rank_of(row_n, min_val_fac);
    }
    return(nns);
  }
}
data {
  int<lower=0> N;
  int<lower=0> M;    // matrix size
  matrix[N, M] NNS; // matrix of neighbors, each row is sorted
  int<lower=1> y[N]; // Outcomes
}
parameters {
  real Intercept;
  real beta_nn;   // slope
  real<lower=0,upper=1> nn_limit; // factor for the neighbors, should be within 1 and 2
}
transformed parameters {
  real nn_limit2 = nn_limit+1;
}
model {
  vector[N] nn_dense;
  vector[N] nn_dense_s;
  vector[N] mu;
  nn_dense = find_neighborhood(NNS, nn_limit2);
  nn_dense_s = (nn_dense-mean(nn_dense))/sd(nn_dense);

  mu = rep_vector(Intercept, N) + beta_nn*nn_dense_s;

  Intercept ~ normal(0, 5);
  beta_nn ~ normal(0, 10);
  nn_limit ~ beta(5,5);
  y ~ poisson_log(mu);
}
generated quantities {
  vector[N] nn_dense;
  nn_dense = find_neighborhood(NNS, nn_limit);
}

N is 35 in my test case, and M some 3000.

The model samples well and does not throw any errors. However, by the nature of rank_of(), nn_dense ends up being round numbers, so I don’t know if this is an issue.
Since I haven’t seen models like this around, I’d be interesting in knowing whether someone has already come up with something more sophisticated/better for doing something similar.

If you’re finding the ranks of bunches of numbers in a vector, the best thing to do is sort the vector and use binary search. That’ll be an \mathcal{O}(n \log n) algorithm instead of \mathcal{O}(n^2).

The way you want to do this, it should be written this way:

real rankof(row_vector[] x, real v) {
  for (n in 1:cols(x)) {
    if (x[n] >= v) return n - 1;
  }
  return size(x);
}

But binary search will be faster if . I renamed the args because I don’t think an array argument should be named vec! I also made the type row_vector so you don’t need the to_array_1d.

vector find_neighborhood(matrix NNS, real nn_limit) {
    int N = rows(NNS);
    vector[N] nns;
    for (n in 1:N) {
      nns[n] = rank_of(NNS[n], NNS[n, 1] * nn_limit);
    }
    return nns;
}  

Warning: I didn’t test this code!

Stan does broadcasting, so this

rep_vector(Intercept, N) + beta_nn*nn_dense_s;

can be just

Intercept + beta_nn * nn_dense_s;

We also allow compound declare-define statements, so this

  vector[N] nn_dense;
  nn_dense = find_neighborhood(NNS, nn_limit);

is more compactly written as

vector[N] nn_dense = find_neighborhood(NNS, nn_limit);

I assume that’s what’s intended, since the function returns integers. I was confused why you added to the nn_limit to get nn_limit2 and why the generated quantities don’t match. Because otherwise you could just save the nn_dense value as a transformed parameter rather than calculating twice.

Thanks a lot for the code suggestions Bob!

I assume that’s what’s intended, since the function returns integers.

I was just unsure whether this was against the restriction on sampling discrete parameters in STAN, but I guess it isn’t.