Get index for sort and sort matrix by index

Hi everyone,

Having correlated responses in statistical analysis is common. For example, in spatial analysis, we customarily use Gaussian Process to capture the spatial pattern. Some modeling methods approximate the original likelihood by a product of conditionally independent processes, whose performance will depend on the order of the observations. The model I am working on has the same issue. Since it requires sorted locations, I want to write a function to order the coordinates base on one dimension, and sort the outcome and the design matrix according to the sorted locations.

My plan is to get the original index of the sorted locations, and then use the index to sort the response and the predictors. But when I tried to use "sort_indices_asc"in stan math library to sort the input coords:
vector<int> sortind = stan::math::sort_indices_asc(coords.col(0));
(here “coords” is Eigen::Matrix <double, N, 2>) I got an error message:
error: no type named 'type' in 'stan::math::index_type<Eigen::Block<Eigen::Matrix<double, -1, -1, 0, -1, -1>, -1, 1, true> >'.

So here are my questions:
How to use the existing functions to

  1. get the index for sort by the first column of a matrix?
  2. sort the target matrix by the index?

Besides, I wrote two functions for the questions above. It will be great if I can use the functions in Stan Math to replace the function I wrote.

get_sort_ind.hpp

#ifndef GET_SORT_IND_HPP
#define GET_SORT_IND_HPP

#include <fstream>
#include <iostream>
#include <vector>
#include <numeric>

using std::vector;
using std::iota;

template <typename T>
void
get_sort_ind(const T& coords, vector<int>& idx){

    idx.resize(coords.size());
    std::iota(idx.begin(), idx.end(), 0);

    sort(idx.begin(), idx.end(),
         [&coords](size_t i1, size_t i2) {return coords[i1] < coords[i2];});

    return;
}

#endif

sort_data_by_ind.hpp

#ifndef SORT_DATA_BY_IND_HPP
#define SORT_DATA_BY_IND_HPP

#include <fstream>
#include <iostream>
#include <vector>
#include <Eigen/Dense>

using std::vector;
using Eigen::Matrix;
using Eigen::Dynamic;

template <typename T, typename size_t>
Matrix<T, Dynamic, Dynamic>
sort_data_by_ind(const Matrix<T, Dynamic, Dynamic> data,
                 const vector<size_t> sortind){

    Matrix<T, -1, -1> sorted_data;

    sorted_data.resize(data.rows(), data.cols());

    for (int i = 0; i < data.rows(); i++ ){
        for (int j = 0; j < data.cols(); j++){
             sorted_data(i, j) = data(sortind[i], j); 
         }
     }
     return sorted_data;
 }
 #endif

Thanks,

Lu Zhang

If you’re writing a C++ function, you can just call C++
library functions directly to do whatever sorting you want.

But I’m not sure how you expect to sort a Matrix<T, Dynamic, Dynamic>.
That’s a matrix, but sorts assume some kind of linear collection.

You can grab the raw data from the matrix a (which will be stored
in column major order if you go with defaults) with &a(0).

  • Bob

Hi Bob,

Thank you. I want to sort a matrix base on the first column. Let me give an example, if I generate data on n locations {(x_i, y_i) : i = 1, …, n}, I need to reorder the observations base on the value of first dimension {x_i: i = 1, …, n}. Anyway, I’ve already worked out the function. I just want to make my code as simple as possible.

Best,

Lu

Worry more about making the program clear and modular
than about efficiency until you have it all working. Then start
making it faster if it’s not fast enough. If you’re talking
about operations in transformed data, they only get executed once,
so are almost never a bottleneck to processing.

The best way to simplify model code is to pull operations
out into user-defined functions with meaningful names. Then
you can see the function does what its named to do and then
understand it more easily where its used. Modularity and
comprehensibility are two sides of the same coin (the other
sides are testing and doc :-)).

If you have this as data

vector[N] x;
vector[N] y;

and you want to create sorted versions in transformed data,
I’d just do this:

transformed data {
vector[N] x_sorted = x[sort_indices_asc[x]];
vector[N] y_sorted = y[sort_indices_asc[x]];

and not worry about doing the sorting twice it’s only an O(N * log(N))
operation.

To only sort once, you have to do this:

transformed data {
vector[N] x_sorted;
vector[N] y_sorted;
{
int[] idxs = sort_indices_asc(x);
x_sorted = x[idxs];
y_sorted = y[idxs];
}

The idxs is declared as a local variable so it’s not stored as
a member variable.

You could also write a function to do the sorting that would work
for arbitrarily many parallel arrays

vector[] sort_asc_by_first(vector[] xs) {
return xs[ , sort_indices_asc(xs[1])];
}

Then you’d have to do this:

{
vector[N] xy_sorted[2] = sort_asc_by_first({x, y});
x_sorted = xy_sorted[1];
y_sorted = xy_sorted[2];
}

but I think that actually makes the whole thing more opaque, so
I wouldn’t recommend it unless you get more than two containers
to sort.

  • Bob

I see. That’s really helpful! Thank you so much!

Hi Bob,

Sorry for multiple questions, but I tried to parse my code in cmdstan and got the following error message:

Model name=nngp
Input file=./examples/nngp/nngp.stan
Output file=nngp.hpp

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

variable “int” does not exist.

ERROR at line 28

26: {
27: for (i in 1:N) coords1[i] = coords[i, 1];
28: int[] idxs = sort_indices_asc(coords1); // get index base on first coordinate
^
29: L_VB = cholesky_decompose(VB);

Here is my code for the transformed data block. :

transformed data {

cholesky_factor_cov[P] L_VB;
vector[N] coords1;
vector[N] Y_sorted;
matrix[N, P] X_sorted;
matrix[N, 2] coords_sorted;

{
for (i in 1:N) coords1[i] = coords[i, 1];
int[] idxs = sort_indices_asc(coords1);       // get index base on first coords
L_VB = cholesky_decompose(VB);

Y_sorted = Y[idxs];
for (k in 1:N){
    for(p in 1:P){
        X_sorted[k, p] = X[idxs[k], p];
    }
}

for (k in 1:N){
    for(p in 1:2){
        coords_sorted[k, p] = coords[idxs[k], p]; // sort coords by index
    }
}
}

}

Do you happen to know how to deal with it?

Sincerely,

Lu Zhang

Hi Lu,

Declarations of variables MUST be at the top of the block.

The problem is that you’re trying to declare the int variable idxs after the for loop.

The quick solution should be to declare at the top and assign after the loop.

Yes, but that does not work. This is the error report:

Model name=nngp
Input file=./examples/nngp/nngp.stan
Output file=nngp.hpp

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

ERROR at line 25

23: matrix[N, P] X_sorted;
24: matrix[N, 2] coords_sorted;
25: int[N] idxs;
^
26:

PARSER EXPECTED:

This is my code:
transformed data {

cholesky_factor_cov[P] L_VB;
vector[N] coords1;
vector[N] Y_sorted;
matrix[N, P] X_sorted;
matrix[N, 2] coords_sorted;
int[N] idxs;

{
for (i in 1:N) coords1[i] = coords[i, 1];
idxs = sort_indices_asc(coords1);       // get index base on first coords
L_VB = cholesky_decompose(VB);

Y_sorted = Y[idxs];
for (k in 1:N){
    for(p in 1:P){
        X_sorted[k, p] = X[idxs[k], p];
    }
}

for (k in 1:N){
    for(p in 1:2){
        coords_sorted[k, p] = coords[idxs[k], p]; // sort coords by index
    }
}
}

}

I can’t tell what’s going on with the indentation. Can you print out the full error message?

Yes, this is the full error message:

Lus-MacBook-Pro:cmdstan luzhang$ bin/stanc --name=nngp --o=nngp.hpp ./examples/nngp/nngp.stan
Model name=nngp
Input file=./examples/nngp/nngp.stan
Output file=nngp.hpp

SYNTAX ERROR, MESSAGE(S) FROM PARSER:


ERROR at line 25

 23:        matrix[N, P] X_sorted;
 24:        matrix[N, 2] coords_sorted;
 25:        int[N] idxs;
               ^
 26:    

PARSER EXPECTED: <identifier>

Lus-MacBook-Pro:cmdstan luzhang$

Can you either post or send me the entire model. It looks like the end of the message is getting truncated, and I won’t be able to tell what the problem is without the rest of the model.

Yes. I tried to compile my code in cmdstan and see whether it works or not.
This is just a toy example I use for the begining.

If you want help with error messages, you’ll need to send your whole program.

If you quote the output in markdown, it should save the initial whitespace in lines and make it easier to see what’s going on.