# Feedback on neighborhood model

I am working on a model which involves calculating the neighborhood density of each observation, and then using that neighborhood density (ND) as a predictor term.
The ND of x_i is calculated as follows:

• find the nearest neighbor of x_i, it is at some distance d
• count how many other observations are withing d * factor , where factor is some real > 1

To calculate ND I use an NxM matrix, with N being the number of observations I’m modelling, and M the number of potential neighbors.

Now, so far I’ve been setting factor by trial and error, but I thought I’d give it a try building a model to estimate it.
The following is what I came up with:

functions{
real rank_of(real [] vec, real val) {
int Ns = size(vec);
real res = 0.0;
for (n in 1:Ns) {
if (vec[n] < val) {
res += 1;
} else {
break;
}
}
return res;
}
vector find_neighborhood(matrix NNS, real nn_limit) {
int Ns = rows(NNS);
int Ms = cols(NNS);
vector[Ns] nns;
for (n in 1:Ns) {
real row_n[Ms];
real min_val_fac;
row_n = to_array_1d(NNS[n]);
min_val_fac = row_n[1] * nn_limit;
nns[n] = rank_of(row_n, min_val_fac);
}
return(nns);
}
}
data {
int<lower=0> N;
int<lower=0> M;    // matrix size
matrix[N, M] NNS; // matrix of neighbors, each row is sorted
int<lower=1> y[N]; // Outcomes
}
parameters {
real Intercept;
real beta_nn;   // slope
real<lower=0,upper=1> nn_limit; // factor for the neighbors, should be within 1 and 2
}
transformed parameters {
real nn_limit2 = nn_limit+1;
}
model {
vector[N] nn_dense;
vector[N] nn_dense_s;
vector[N] mu;
nn_dense = find_neighborhood(NNS, nn_limit2);
nn_dense_s = (nn_dense-mean(nn_dense))/sd(nn_dense);

mu = rep_vector(Intercept, N) + beta_nn*nn_dense_s;

Intercept ~ normal(0, 5);
beta_nn ~ normal(0, 10);
nn_limit ~ beta(5,5);
y ~ poisson_log(mu);
}
generated quantities {
vector[N] nn_dense;
nn_dense = find_neighborhood(NNS, nn_limit);
}



N is 35 in my test case, and M some 3000.

The model samples well and does not throw any errors. However, by the nature of rank_of(), nn_dense ends up being round numbers, so I don’t know if this is an issue.
Since I haven’t seen models like this around, I’d be interesting in knowing whether someone has already come up with something more sophisticated/better for doing something similar.

If you’re finding the ranks of bunches of numbers in a vector, the best thing to do is sort the vector and use binary search. That’ll be an \mathcal{O}(n \log n) algorithm instead of \mathcal{O}(n^2).

The way you want to do this, it should be written this way:

real rankof(row_vector[] x, real v) {
for (n in 1:cols(x)) {
if (x[n] >= v) return n - 1;
}
return size(x);
}


But binary search will be faster if . I renamed the args because I don’t think an array argument should be named vec! I also made the type row_vector so you don’t need the to_array_1d.

vector find_neighborhood(matrix NNS, real nn_limit) {
int N = rows(NNS);
vector[N] nns;
for (n in 1:N) {
nns[n] = rank_of(NNS[n], NNS[n, 1] * nn_limit);
}
return nns;
}


Warning: I didn’t test this code!

rep_vector(Intercept, N) + beta_nn*nn_dense_s;


can be just

Intercept + beta_nn * nn_dense_s;


We also allow compound declare-define statements, so this

  vector[N] nn_dense;
nn_dense = find_neighborhood(NNS, nn_limit);


is more compactly written as

vector[N] nn_dense = find_neighborhood(NNS, nn_limit);


I assume that’s what’s intended, since the function returns integers. I was confused why you added to the nn_limit to get nn_limit2 and why the generated quantities don’t match. Because otherwise you could just save the nn_dense value as a transformed parameter rather than calculating twice.

Thanks a lot for the code suggestions Bob!

I assume that’s what’s intended, since the function returns integers.

I was just unsure whether this was against the restriction on sampling discrete parameters in STAN, but I guess it isn’t.