Fitting a distribution across multiple sites without using a for loop

There is probably a really easy solution to this, but I’m coming up short. I have code where I fit a Pareto distribution to data, and I’m looking to find out what the alpha parameter is for the distribution:

data {
	int<lower=0> N;
	real<lower=0> x_min;
	vector<lower=0>[N] x;
}

parameters {
	// Pareto density
	real<lower=0, upper=5> alpha;
}

model {
	// Prior: Pareto density
	alpha ~ lognormal(1, 1) T[0, 5];
	// x_min uniform on its interval.
	
	// Likelihood: Pareto density
	x ~ pareto(x_min, alpha);
}

This works well for a single measured site, where

   
          N<-length(data$x)
          x<-data$x
          x_min<-x_min
          
          stan_dat<-list(N=N,x=x,x_min=x_min)

However, I have more than one site (so, data$site_label exists). I could just fit a distribution to each site in a for loop in R and get the outputs individually, but is there a way to do all that in just one model? How would I code that?

Whether you implement it with matrix operations or a for loop, as long as each term is incrementing the log posterior density, it will all be instantiated as one model. If you wanted to use the data from each site to constrain the parameters from each other site, you could use a hierarchical noncentered model by choosing some hyperpriors for the alpha distribution such that, for example:

data{

 int<lower=0> N;
 int<lower=0> nSites;
 vector<lower=0>[nSites] x_min;
 array[nSites] vector<lower=0>[N] x;

}

parameters {

 real alphaMu;
 real<lower=0> alphaSigma;

 vector<lower=0, upper=5>[nSites] alpha;

}

model{

 alphaMu ~ normal(0,1);
 alphaSigma ~ lognormal(0,1);

 alpha ~ lognormal(alphaMu, alphaSigma) T[0,5];

 for (site in 1:nSites) {
  target += pareto_lcdf(x[site] | x_min[site], alpha[site])
 }

}

However, I do suspect you’ll run into some computational issues selecting a lognormal distribution and bounding it (0,5), regardless of the mu and sigma parameters the model finds for the alpha distribution.

Without hierarchical modeling:

data{

 int<lower=0> N;
 int<lower=0> nSites;
 vector<lower=0>[nSites] x_min;
 array[nSites] vector<lower=0>[N] x;

}

parameters {

vector<lower=0, upper=5>[nSites] alpha;

}

model{

 alpha ~ lognormal(1, 1) T[0,5];

 for (site in 1:nSites) {
  target += pareto_lcdf(x[site] | x_min[site], alpha[site])
 }

}
1 Like

Ah, thank you! That second one without hierarchical modeling is definitely the one I was looking for. I was using an external for loop to the modeling process (meaning, rerunning the model independently to a different site each time I wanted to look at a new site) so this version is perfect.

I’m not familiar with making an array to prep for that code, is this how you do it?

# Prepare data for Stan
stan_data <- list(
  N = nrow(df),                   # Number of observations
  nSites = length(unique(df$site)),  # Number of sites
  x_min = df$x_min,               # Minimum values of 'x' for each site
  x = split(df$x, df$site)        # Split 'x' by site into a list
)

# Convert 'x' to array
stan_data$x <- lapply(stan_data$x, as.array)

# Convert list to array
stan_data$x <- array(unlist(stan_data$x), dim = c(length(stan_data$x[[1]]), length(stan_data$x)))

You would use a structural variable where each entry in the array was a cell, then save the structural variable as a .json file to use as the input file for Stan.

If this were MATLAB, you would have:

Var.N
Var.nSites
Var.x_min
Var.x{site}

jsonwrite(Var,‘filename’) where jsonwrite was your .json writing function and ‘filename’ was the name of the output file. This is my handwritten MATLAB function for that:

function [] = jsonwrite(var,filename)

%used to write a structured variable to a json file

%turn the data matrix into its json equivalent

jsf = jsonencode(var);

%prepare the file name

filesave = strcat(filename,'.json');

%begin a file with writing permission

fid = fopen(filesave,'w');

%write to text file

fprintf(fid,'%s',jsf);

%close the file

fclose(fid);

end```

One thing that occurs to me: I don’t have the same number of data points x for each site. Wouldn’t an array need to have the same number of data points for each site?

You would have to pad your data. Stan does not handle ragged data sets.