Returning quantiles (or weighted quantiles) in generated quantities block

Hi, I’m running a complex Stan model via RStan. It estimates population densities over space and time. I want to calculate some summary statistics of these populations (like center of spatial distribution, edge of spatial distribution, etc) within Stan as generated quantities, so they have their own posterior distributions, rather than calculating them post hoc in R.

Some of these are calculated as weighted quantiles. For example, the center of spatial distribution is the median of observations across space, weighted by the size (abundance) of each observation. The edge is usually calculated a 0.05 or 0.95 quantile of the same.

I know a quantile function has recently been implemented in Stan, but I’m having trouble getting it to work—the function reference link is broken and I haven’t totally parsed the long GH issues discussing its development.

I tried writing a quick model to explore how the quantile function works, but can’t even get that to work. So, this is a two-part question:

  1. HOW DOES QUANTILE() WORK IN STAN?

My toy model:

data {
real probs;
int np;
int ny;
}

parameters {
  matrix<lower=0>[np, ny] N;  
}

model{
  for(p in 1:np){
    for(y in 1:ny){
      N[p,y] ~ normal(10000, 3000);
    }
  }
}

generated quantities {
    real quant_out[ny]; 
    for(y in 1:ny){
quant_out = quantile(N[,y], probs); 
    }
}

Returns this error that seems to imply that quantile() doesn’t return anything:

A returning function was expected but an undeclared identifier 'quantile' was supplied.

And for extra credit…
2. HAS ANYONE WRITTEN A USER-DEFINED FUNCTION FOR CALCULATING WEIGHTED QUANTILES IN STAN?

This function would take as arguments a real probability / quantile, a vector or 1D array of data, and a vector or 1D array of weights (same dimensions as the data). I understand there are multiple ways to calculate weighted quantiles, and I’m not too picky about the exact method, since I’ll explore sensitivity to it regardless.

here’s the link to the function reference for the standard normal cdf Phi() if that’s what you wanted:

We also have _cdf and _lcdf for cdfs and log cdfs.

I was about to say there wasn’t a quantile() function in Stan, but apparently someone added one; the doc is here:

These only work for data, not for parameters, so your program isn’t going to compile. You need at least Stan 2.27, which means you are going to have to use cmdstan, cmdstanr, or cmdstanpy (or maybe PyStan, but rstan is out of date on CRAN and is only up to 2.26 on GitHub (this is due to CRAN’s restrictive policies on package size and dependencies and the Stan project not organizing itself around CRAN releases).

And we also don’t have inverse CDF functions implemented other than for the normal distribution. But given that N[p, y] is a truncated normal(1e4, 3e3), you can compute the actual quantile for the density, not the empirical one from your sample.

Also, if you can get parameters closer to unit scale, Stan will work a lot better. We don’t have a way to supply both a lower bound and an offset and multiplier. So you’re probably better off implementing this as:

parameters {
  matrix<lower=0>[np, ny] N_std;
}
transformed parameters {
  matrix<lower=0>[np, ny] N = 3e3 + 1e4 * N_std;
}
model {
  to_vector(N_std) ~ std_normal();
}

Stan samples the parameters, so this puts the sampler back on the centered unit scale.

I haven’t seen one, but then I don’t know what a weighted quantile is. If you can write down the definition, it shouldn’t be too hard to define. But the problem is that it’s probably not going to do what you want, which is why we only have quantiles implemented for data. The problem with selecting and branching is that we can’t autodiff through those choices, so you wind up disconnecting the densities.

Thanks so much for all of this valuable explanation, I really appreciate it. I got approximately the function I wanted since posting this (apologies that there are some use-case specific things in here about indexing over patches and years, as I’m modeling changing species distributions). The note that quantile() only works for data explains a lot! Your reply gives me yet another reason to transition this implementation from rstan to cmdstanr.


  // cumulative sum function
  // slightly different than the built-in Stan one -- this takes a 1D array of numbers and returns a single real number that sums elements 1:x of that array
  real csum(real[] to_sum, int x){
    return(sum(to_sum[1:x])); 
  }
  
  // function to calculate range quantiles
  
  real calculate_range_quantile(int np, vector patches, real[] dens_by_patch, real quantile_out){
    vector[np] csum_dens;
    vector[np] quant_p; 
    real diff_p[np]; 
    real quant_position; 
    real dens_p[np];
    
    for(i in 1:np){
      csum_dens[i] = csum(dens_by_patch[1:np], i); // calculate cumulative sum of density along each patch 
    }
    
    quant_p = csum_dens / sum(dens_by_patch[1:np]); // turn in to proportions, i.e., quantiles 
    
    for(i in 1:np){
      diff_p[i] = fabs(quant_p[i] - quantile_out); // minimizes absolute distance between estimated and desired quantile 
    }
    
    // DOESN'T DEAL WITH TIES! (but they seem unlikely with only 10 patches)
    
    // probably a more elegant way to do this... 
    for(i in 1:np){
      if(diff_p[i] == min(diff_p)) {
        quant_position = patches[i];
      }
    }
    return(quant_position); 
    
  } // close quantile function