Multivariate bernoulli model

Greetings,

new-ish stan user here and looking to get a hand with my code.

I am attempting to model a multi-variate bernoulli distribution parameterised
the same way as Dai et al. (2013).

P(Y_1=y_1,Y_2=y_2) = p_{00}^{(1-y_1)(1-y_2)}p_{01}^{(1-y_1)y_2}p_{10}^{y_1(1-y_2)}p_11^{y_1y_2}

and looking to infer the parameters p based on observations of Y. Essentially, my approach is to map an observed combination of bernoulli outcomes to an integer. This integer will be the conversion of binary to decimal+1 since indexing is 1 based.

functions {

	int f_getIndex(int[] vY){
		real index = 1;
		int useIndex = 1;
		int K = num_elements(vY);
		for (i in 1:K){
			index = index + vY[i]*pow(2,K-i);
		}
		while(useIndex < index-0.1){
		  useIndex = useIndex + 1;
		}
		
		return useIndex;
	}

	real indexMVbern_lpmf(int index, real[] prob_vec){
		real out;
		out = log(prob_vec[index]);
		return out;
	}
}
data {
	int N; //rows or observations
	int K; //cols or number of classifiers
	int y[N, K];
}
transformed data {
	vector[N] indices;
	for (i in 1:N){
		indices[i] = f_getIndex(y[i,]);
		print(indices[i]);
	}
	
}
parameters {
	simplex[K] theta;
	vector[K] alpha;
}
model {
    for (i in 1:N){
      indices[i] ~ indexMVbern_lpmf(theta);
    }
    theta ~ dirichlet(alpha);
}

My issue is that I am getting the error “No matches for: Available argument signatures for indexMVbern_lpmf: Real return type required for probability function”. I am not sure why I am getting this as I think the variable out being returned by indexMVbern_lpmf is indeed real.

Any help would be much appreciated.

The issue is that indexMVbern_lpmf is expecting the second to argument to be of type real[], but you’re passing a simplex, which is a vector.

You just need to change the declaration to:

real indexMVbern_lpmf(int index, vector prob_vec)

Thank you for your suggestion @andrjohns ! That’s gotten my code to advance, but now I receive the following error.

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
No matches for: 

  real ~ indexMVbern_lpmf(vector)

Available argument signatures for indexMVbern_lpmf:

  int ~ indexMVbern_lpmf(vector)

If I’m understanding this correctly, this is because even though I’m storing integer values into indices it is a vector and cannot be typed to return an int. Would I be able to get around this by changing the suffix of my function to _lpdf and keeping everything the same?

The simplest approach is to just change indices to be an array of integers:

	int indices[N];

Thanks for that! My code is now working :)

One last thing, is there a way that I can force the prior for \theta to be dirichlet(1)?

As in rather than D(\alpha_1,...,\alpha_K), you want the prior to be D(1_1,...,1_K)?

If so, you can just remove alpha as a parameter and specify that the prior is a vector of 1s:

parameters {
	simplex[K] theta;
}
model {
    for (i in 1:N){
      indices[i] ~ indexMVbern_lpmf(theta);
    }
    theta ~ dirichlet(rep_vector(1,K));
}

Although it would be more efficient to construct the rep_vector(1,K) in the transformed data block and then re-use it.

1 Like

Thank you so much for your continued help!

Now, here is my Rcod used to generate data and run the stan file

library(rstan)
library(dplyr)
library(DirichletReg) ##for rdirichlet

# function to generate mvBern
f_mvBern <- function(K,p){
  numClass <- as.integer(log2(length(p)))
  output <- matrix(rep(0,numClass*K),nrow = K)
  for (i in 1:K){
    r <- runif(1)
    print(r)
    index <- which(cumsum(p)>r)[1]
    print(index)
    output[i,] <- intToBits(index)[1:numClass] %>% as.integer() %>% rev()
  }
  print(output)
}

K = 2 #number of classifiers
N = 100 #number of samples

myprob <- rdirichlet(1,rep(1,2^K))
y <- f_mvBern(N,myprob)

myData <- list(N = N, K = K, y = y)
init_fun <- function(){list(theta=rep(1,2^K)/(2^K))}
fit <- stan('mvBern.stan', data = myData, warmup = 10000,iter = 30000,thin=100, init = init_fun)

Along with my modified stan code

functions {

	int f_getIndex(int[] vY){
		real index = 1;
		int useIndex = 1;
		int K = num_elements(vY);
		for (i in 1:K){
			index = index + vY[i]*pow(2,K-i);
		}
		while(useIndex < index-0.1){
		  useIndex = useIndex + 1;
		}
		
		return useIndex;
	}

	real indexMVbern_lpmf(int index, vector prob_vec){
		real out;
		out = log(prob_vec[index]);
		return out;
	}
}
data {
	int N; //rows or observations
	int K; //cols or number of classifiers
	int y[N, K];
}
transformed data {
	//vector[N] indices;
	//for (i in 1:N){
		//indices[i] = f_getIndex(y[i,]);
		//print(indices[i]);
	//}
	vector[4] alpha; //4 is hard coded for now
	int indices[N];
	for (i in 1:N){
		indices[i] = f_getIndex(y[i,]);
	}
	
	alpha = rep_vector(1,4); //4 is hard coded for now
}
parameters {
	simplex[4] theta; //4 is hard coded for now

}
model {
    for (i in 1:N){
      indices[i] ~ indexMVbern_lpmf(theta);
    }
    theta ~ dirichlet(alpha);
}

The code runs in a reasonable time, but I notice that the estimates for theta appear to be in a different order to that of myprob with the following:

            mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
theta[1]    0.17    0.00 0.04    0.11    0.15    0.17    0.19    0.24   782    1
theta[2]    0.31    0.00 0.04    0.22    0.28    0.31    0.34    0.40   712    1
theta[3]    0.35    0.00 0.05    0.26    0.32    0.35    0.38    0.45   743    1
theta[4]    0.17    0.00 0.04    0.11    0.15    0.17    0.20    0.25   826    1
lp__     -140.60    0.05 1.32 -143.82 -141.20 -140.27 -139.65 -139.16   779    1

and

> myprob
          [,1]      [,2]      [,3]      [,4]
[1,] 0.3257096 0.3673334 0.1386896 0.1682674

Why might this be the case? is there some un-identifiability in my model or have I perhaps muddled up the coding?