# Multivariate bernoulli model

Greetings,

new-ish stan user here and looking to get a hand with my code.

I am attempting to model a multi-variate bernoulli distribution parameterised
the same way as Dai et al. (2013).

P(Y_1=y_1,Y_2=y_2) = p_{00}^{(1-y_1)(1-y_2)}p_{01}^{(1-y_1)y_2}p_{10}^{y_1(1-y_2)}p_11^{y_1y_2}

and looking to infer the parameters p based on observations of Y. Essentially, my approach is to map an observed combination of bernoulli outcomes to an integer. This integer will be the conversion of binary to decimal+1 since indexing is 1 based.

functions {

int f_getIndex(int[] vY){
real index = 1;
int useIndex = 1;
int K = num_elements(vY);
for (i in 1:K){
index = index + vY[i]*pow(2,K-i);
}
while(useIndex < index-0.1){
useIndex = useIndex + 1;
}

return useIndex;
}

real indexMVbern_lpmf(int index, real[] prob_vec){
real out;
out = log(prob_vec[index]);
return out;
}
}
data {
int N; //rows or observations
int K; //cols or number of classifiers
int y[N, K];
}
transformed data {
vector[N] indices;
for (i in 1:N){
indices[i] = f_getIndex(y[i,]);
print(indices[i]);
}

}
parameters {
simplex[K] theta;
vector[K] alpha;
}
model {
for (i in 1:N){
indices[i] ~ indexMVbern_lpmf(theta);
}
theta ~ dirichlet(alpha);
}



My issue is that I am getting the error “No matches for: Available argument signatures for indexMVbern_lpmf: Real return type required for probability function”. I am not sure why I am getting this as I think the variable out being returned by indexMVbern_lpmf is indeed real.

Any help would be much appreciated.

The issue is that indexMVbern_lpmf is expecting the second to argument to be of type real[], but you’re passing a simplex, which is a vector.

You just need to change the declaration to:

real indexMVbern_lpmf(int index, vector prob_vec)


Thank you for your suggestion @andrjohns ! That’s gotten my code to advance, but now I receive the following error.

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
No matches for:

real ~ indexMVbern_lpmf(vector)

Available argument signatures for indexMVbern_lpmf:

int ~ indexMVbern_lpmf(vector)


If I’m understanding this correctly, this is because even though I’m storing integer values into indices it is a vector and cannot be typed to return an int. Would I be able to get around this by changing the suffix of my function to _lpdf and keeping everything the same?

The simplest approach is to just change indices to be an array of integers:

	int indices[N];


Thanks for that! My code is now working :)

One last thing, is there a way that I can force the prior for \theta to be dirichlet(1)?

As in rather than D(\alpha_1,...,\alpha_K), you want the prior to be D(1_1,...,1_K)?

If so, you can just remove alpha as a parameter and specify that the prior is a vector of 1s:

parameters {
simplex[K] theta;
}
model {
for (i in 1:N){
indices[i] ~ indexMVbern_lpmf(theta);
}
theta ~ dirichlet(rep_vector(1,K));
}


Although it would be more efficient to construct the rep_vector(1,K) in the transformed data block and then re-use it.

1 Like

Thank you so much for your continued help!

Now, here is my Rcod used to generate data and run the stan file

library(rstan)
library(dplyr)
library(DirichletReg) ##for rdirichlet

# function to generate mvBern
f_mvBern <- function(K,p){
numClass <- as.integer(log2(length(p)))
output <- matrix(rep(0,numClass*K),nrow = K)
for (i in 1:K){
r <- runif(1)
print(r)
index <- which(cumsum(p)>r)[1]
print(index)
output[i,] <- intToBits(index)[1:numClass] %>% as.integer() %>% rev()
}
print(output)
}

K = 2 #number of classifiers
N = 100 #number of samples

myprob <- rdirichlet(1,rep(1,2^K))
y <- f_mvBern(N,myprob)

myData <- list(N = N, K = K, y = y)
init_fun <- function(){list(theta=rep(1,2^K)/(2^K))}
fit <- stan('mvBern.stan', data = myData, warmup = 10000,iter = 30000,thin=100, init = init_fun)



Along with my modified stan code

functions {

int f_getIndex(int[] vY){
real index = 1;
int useIndex = 1;
int K = num_elements(vY);
for (i in 1:K){
index = index + vY[i]*pow(2,K-i);
}
while(useIndex < index-0.1){
useIndex = useIndex + 1;
}

return useIndex;
}

real indexMVbern_lpmf(int index, vector prob_vec){
real out;
out = log(prob_vec[index]);
return out;
}
}
data {
int N; //rows or observations
int K; //cols or number of classifiers
int y[N, K];
}
transformed data {
//vector[N] indices;
//for (i in 1:N){
//indices[i] = f_getIndex(y[i,]);
//print(indices[i]);
//}
vector[4] alpha; //4 is hard coded for now
int indices[N];
for (i in 1:N){
indices[i] = f_getIndex(y[i,]);
}

alpha = rep_vector(1,4); //4 is hard coded for now
}
parameters {
simplex[4] theta; //4 is hard coded for now

}
model {
for (i in 1:N){
indices[i] ~ indexMVbern_lpmf(theta);
}
theta ~ dirichlet(alpha);
}


The code runs in a reasonable time, but I notice that the estimates for theta appear to be in a different order to that of myprob with the following:

            mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
theta[1]    0.17    0.00 0.04    0.11    0.15    0.17    0.19    0.24   782    1
theta[2]    0.31    0.00 0.04    0.22    0.28    0.31    0.34    0.40   712    1
theta[3]    0.35    0.00 0.05    0.26    0.32    0.35    0.38    0.45   743    1
theta[4]    0.17    0.00 0.04    0.11    0.15    0.17    0.20    0.25   826    1
lp__     -140.60    0.05 1.32 -143.82 -141.20 -140.27 -139.65 -139.16   779    1


and

> myprob
[,1]      [,2]      [,3]      [,4]
[1,] 0.3257096 0.3673334 0.1386896 0.1682674


Why might this be the case? is there some un-identifiability in my model or have I perhaps muddled up the coding?