Coding up a new discrete multivariate distribution

Hello, Suppose X=(X_1,X_2,X_3) are dependent r.v.'s such that X_i\in\{1,2,3\}\,\forall i. Note that X can assume \mathcal{X}=\left[\{1,2,3\},\{1,3,2\},\{3,2,1\},\{3,1,2\},\{2,1,3\},\{2,3,1\}\right], Let I_j (j=1,\cdots,6) denote each combination in \mathcal{X}.

The probability mass function is assumes a different value for each combination I_j, that is

P(X_1=x_1,X_2=x_2,X_3=x_3)=\begin{cases} p_1(\mathcal{x}|a,b,c); & \mathcal{x}=I_1 \\ p_2(\mathcal{x}|a,b,c); & \mathcal{x}=I_2 \\ \cdots \\ p_6(\mathcal{x}|a,b,c); & \mathcal{x}=I_6 \end{cases}

a and b are parameters present in all the expressions p_j.

I would like to define this distribution by using the “functions” environment in Stan, but I am sure how to do it efficiently, can we create conditional operations within the functions, so that Stan will consider only those expressions corresponding to the data points?

functions {
    real new_lpmf(vector x, real a, real b) {
    }
}

Thanks.

1 Like

@andrjohns

If I’m reading this correctly, you’ll just want to use if else statements, so something like:

functions {
  real new_lpmf(vector x, real a, real b) {
    if (x[1] == 1 && x[2] == 2 && x[3] == 3) {
      return ...;
    } else if (x[1] == 1 && x[2] == 3 && x[3] == 2) {
      return ...;
    }
  }
}
1 Like

andrjohns, thanks for your reply. That solution seems promising, however, thinking of a larger vector x, that would require a quite long code, since we would have to consider all the combinations. So, in order to avoid the long if (x[1] == 1 && x[2] == 2 && ... && x[r] == r), I am thinking of concatenate the data, so that we have to test only a sequence, that is

if (x == '123') {
return ...;} 
else if (x=='132') { 
return...;} 

Just for the record, I have 64 instances in my problem. I will try it. thanks.

One possibility is to create indicators that map X onto a particular sequence. For example, [1,2,3] maps to 1, [1,3,2] maps to 2, and so on. You could do this outside of Stan and pass it (as below) or you could do it in the transformed data block. You would then only need to calculate the lpmf for each unique sequence rather than each observation, then vectorize the last step with the indicators. So, if there are 10 observations in each of the 64 possible sequences (N = 640), you would only need to go through the loop 64 times rather than 640.

Depending on p(x|a,b,c), you might be able to speed up calculation further without loops. But the example below loops across the number of unique sequences and applies an arbitrary function. I may not have the function declaration 100% correct (I’m a little rusty), but hopefully the idea is clear enough.

functions{
    real new_distribution_lpmf(int[] y, int[] x, int O, real a, real b){
        vector[O] theta;

        // Loop over unique sequences only
        for(o in 1:O){
            theta[o] = some_function(x[o], a, b);
        }

        // Vectorize selection of sequences by observation
        return log(theta[y]);
    }
}

data{
    int N;  // Number of observations
    int M;  // Dimension of X
    int O;  // factorial(M) = number of sequences (or number of realized sequences)

    int[M] X[O]; // Values for each unique sequence
    int<lower = 1, upper = O> Y[N]; // Observation-sequence indicators
}
parameters{
    real alpha;
    real beta;
}
model{
    alpha ~ normal(0, 1);
    beta ~ normal(0, 1);

    Y ~ new_distribution(X, alpha, beta);
}

Note that if you do the data preparation outside Stan (as above), you could always pass O as the number of realized sequences and map Y accordingly. This could probably still be done within the transformed data block, but might be trickier. I know this isn’t exactly what you’re asking about, but could still be helpful.

2 Likes

@Ludwig and @andrjohns: Whenever you see tests against range-limited indexes, it suggests producing an indexing to just look up answers. Along those lines, I’d suggest something like this:

functions {
  real foo_lpmf(array[ , ] int y, log_p) {
     real lp = 0;
     for (y in ys) lp += log_p(y);
     return lp;
  }
data {
  int<lower=0> N;
  array[3,3,3] real<lower = 0, upper = 1> log_p;  // log probabilities
  array[N, 3] int<lower=1, upper=3> y;  // observations
}
model {
  y ~ foo(log_p);
}  

You might need to compute these log probabilities in the transformed data block from your I matrix, and it’s OK to leave some empty if they won’t get used, though I’d suggest testing in transformed data that your data really is limited to the cases you suggest.

Then, if you want to really make this fast, you can do this, where the data block is the same as before (I like to leave the natural interface on the outside).

functions {
  int to_index(int i, int j, int k) {
    return 9 * i + 3 * j + k;
  }
}
transformed data {
  int<lower=0> N;
  array[N, 3] int<lower=1, upper=3> y;
}
transformed data {
  // flatten (log_p, y) to (log_pt, yt) to enable vectorization
  array[27] log_pt;
  for (i in 1:3) {
    for (j in 1:3) {
      for (k in 1:3)  
        log_pt[to_index[i, j, k]] = log_p[i, j, k];

  array[N] int<lower=1, upper=27> yt;
  for (n in 1:N)
    yt[n] = to_index(y[n, 1], y[n, 2], y[n, 3]);
}
model {
  target += log_pt[yt];  // vectorized
}

At some point, it becomes impractical to generate the log_pt and yt if they’re only sparsely filled and the only logic that makes sense is the branching logic.

1 Like

Simon, Yes it is possible to create a 1–1 correspondence to each combination in the dataset. I will try to figure it out. thanks.

Dear @simonbrauer and @Bob_Carpenter, looking at your suggestions, I made the following:

  1. organizing the pmf expression in a data file:
probs=
ys probability
1  log(p1)
2  log(p2)
3  log(p3)
...
6  log(p6)

For instance:
log(p1)=log(exp(-(x-a)^2/2))

ys is auxiliary variable which maps every combination I_j to a element in ys=\{1,2,3,4,5,6\}.

Here comes a Stan language question, can I write expressions in a data.frame and declare them as “data”?

If Stan does not accept that, how can I insert a file with all the instances \log(p_j)?

  1. Suppose that we observed y=\{1,1,2,4,4,4,4,5,5,5\}, so we have to match y with ys, so that only the corresponding expressions will be included in the likelihood. My first thought was to used if's, but that would slow down the simulation.

If I can declare those expressions as data, we could try to merge data with y or, alternatively, we could do it outside Stan and bring only the data.frame with the observed instances. But I am not sure how to do a proper Stan coding, in order to optimize the performance(?).

data_list=list(...,y=y, ys=data[,1], probs=data[,2],...)

functions {
  real new_lpmf(vector int y, log_p) {
    real lp = 0;
     for (y in ys) lp += log_p;
     return lp;
  }

Cheers.

Dear @simonbrauer and @Bob_Carpenter, an errata in my latest post: I am actually “NOT sure how to do a proper Stan coding”.

Since I have quite long expressions for p_1,p_2,..., I think the best option is to write down the likelihood outside Stan and bring it into Stan. That is, for the sample y=\{1,1,2,4,4,4,4,5,5,5\}, the likelihood is

L(a,b,c)=[p_1(y=1|a,b,c)]^2\times[p_2(y=2|a,b,c)]\times[p_4(y=4|a,b,c)]^4\times[p_5(y=5|a,b,c)]^3

So, we need just to declare the log(L), please see if this will work (do I really need this ‘aux’?):

let datalist=list(...,aux=rep(1,N),…)

functions {
real new_lpmf(vector int[] aux, vector real a, vector real b, vector real c){
    log_p = aux*log(L)
    return log_p;
  }
}
model{ 
a ~ normal(0,1)
b ~ normal(0,1)
c ~ normal(0,1)
for(i in 1:N){
    target =  new_pmf(aux[i] | a, b, c);
}
...
}

Cheers.

Hmm, I’m not sure where to go from here. While you wait for others to chime in, could you provide the exact likelihood for the model you’re thinking about? Maybe that will clarify things.

Sorry for the mess. I will try to elaborate better. Just to follow up on what we’ve discussing:

  1. Following your suggestion, I created a variable y which maps all the combinations within \mathcal{X}=\left[\{1,2,3\},\{1,3,2\},\{3,2,1\},\{3,1,2\},\{2,1,3\},\{2,3,1\}\right], so now the data will assume values y=\{1,2,3,4,5,6\}.

  2. The new distribution (in my first post) is now given by

P(X_1=x_1,X_2=x_2,X_3=x_3)=P(Y=y)=\begin{cases} p_1(y|a,b,c); & y=1 \\ p_2(y|a,b,c); & y=2 \\ \cdots \\ p_6(y|a,b,c); & y=6 \end{cases}

This formulation saves us time, since we don’t need to work with the vectors \{1,2,3\}, \{1,3,2\}, etc.

  1. My interest is to estimate a, b, and c, also note that given these parameters, the observations are independent. Thus, for a particular sample, say y=\{1,1,2,4,4,4,4,5,5,5\}, the likelihood is
L(a,b,c)=[p_1(y=1|a,b,c)]^2\times[p_2(y=2|a,b,c)]\times[p_4(y=4|a,b,c)]^4\times[p_5(y=5|a,b,c)]^3
  1. Now comes the Stan part. As far as I can see, I think we have two options here:

i) Tell Stan to get every y in the data, check which part (p_1,p_2,...) the data correspond to, then declare the joint distribution of the data. Perhaps, this is a slower process.

ii) Another option, which I think is more efficient, is to write down the log(Likelihood) outside Stan, just to use \log(L(a,b,c)) (L as above), then declare this log-likelihood in Stan. But I don’t know how to do it.

  1. My latest post was just a try, I created an auxiliary variable aux, just to let Stan ‘think’ this is the data, when in fact the data are already embraced in the expression L(a,b,c). This strategy may look not very smart, due to my lack of fluency in Stan language.

Any ideas would be very useful.

Regards

1 Like

Dear @simonbrauer and @Bob_Carpenter, I have followed some of your ideas and came up with the code below. This is the actual code, not the generic form I was discussing about.

code.txt (123.6 KB)

I get the error messages:

Chain 1: Rejecting initial value:
Chain 1:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 1:   Stan can't start sampling from this initial value.
...
Chain 1: Initialization between (-2, 2) failed after 100 attempts. 
Chain 1:  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."   
[2] "In addition: Warning message:"                                            
[3] "In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :"
[4] "  '-E' not found"                                                         
[1] "error occurred during calling the sampler; sampling not done"

I have changed the prior in several ways, and it does not seem to have to do with the prior initial values. Am I doing something wrong? cheers.

It’s very hard to tell what is going on in your code given how expanded it is. I’m unsure what exactly to ask to help clarify, but here is a question, a suggestion, and an observation.

  1. a, b, and \theta seem straight-forward enough looking at the transformed parameters block. But all of the \sigma terms are lost on me. Could you clarify what they mean?
  2. Can you simplify the code such that there are only two possible sequences? That might make it clearer what is going on.
  3. It looks like your data consist of 2 integers from which you are trying to estimate substantially more parameters with vague priors (normal(0,100) on most). That seems unfeasible on the face of it.

Sorry that I cannot be of more help at this point.

1 Like

Hi simon, thanks for your reply. Answering your questions:

The new distribution loglike is so complex that I opted to write it down and then copy-paste into the code, so it brings already all the data. The remaning data are just the dimensions of the matrix ‘p’.

  1. So, all the sigmas are part of the new distribution, together with the matrix ‘p’ (which is a function of a, b and "\theta$), declared in the transformed parameters block.

  2. A simplified version would involve less parameters and the new distribution expression would become simpler, please see the simplified code:

code - simplified.txt (1.6 KB)

  1. It’s because all the data are already in the expression of the loglike.

Cheers.