Multinomial distribution for modelling Causal Models with all categorical variables

I am beginner using Stan for modelling structural causal models. I would be extremely grateful if someone could please help with the following query. I have looked in the Stan docs as well as the forums, but I wasn’t able to find the answer.

My data contains only categorical variables and I am trying to model the causality between different variables using multinomial distribution. Then send the data from Stan to R to predict rating using all the remaining attributes (topic, location,age,sex) using logistic regression.

The data is as follows :


location     age       sex       topic               rating
denmark    under 35     M        Computers             1      
germany    over 45      F        Cars                  3      
denmark    over 45      M        Electronics           4      
france     under 35     F        Plants                5     

I am trying to model the following equations :
Location \sim multinomial(age+sex)
topic \sim multinomial(age+sex)
rating \sim multinomial(age+sex)

The Stan code is as follows :


data{
  int<lower = 0> N; // number of instances in the data
  vector[N] age; // age
  vector[N] sex; //sex
  int location[N]; // location
  int topic[N]; // topic
  int rating[N]; //  rating 
  
  
}
parameters{
  real attitude;
  simplex[N] alpha;
  simplex [N] beta;
  simplex [N] gamma;
  alpha = softmax (age+sex);
  beta = softmax (age+sex) ;
  gamma = softmax (age+sex);
}

model{
 location ~ multinomial( alpha)
  topic ~multinomial(beta)
rating ~multinomial (gamma)
}

The R code is as follows:

raw_data <- read.csv("tp_data.csv")
str(raw_data)
tp <- dplyr::select(raw_data,age,location,sex,rating,topic) 

#Converting Categorical variables to numeric 

tp$location <- as.numeric(as.factor(tp$location))
tp$rating <- as.numeric(as.factor(tp$rating))
tp$topic<- as.numeric(as.factor(tp$topic))
tp$age <- as.numeric(as.factor(tp$age))
tp$sex <- as.numeric(as.factor(tp$sex))


#A series of test/training partition creation and p is percentage of data into training 
trainIndex <- createDataPartition(tp$location, p = .8, 
                                  list = FALSE, 
                                  times = 1)
tpTrain <- tp[trainIndex,]
tpTest  <- tp[-trainIndex,]


n <- nrow(tpTrain)
ne <- nrow(tpTest)



# ------------------------------
tp_stan_train <- list(N = n, age = tpTrain[,c("age")], sex = tpTrain[,c("sex")],
                       location = tpTrain[,c("location")], topic = tpTrain[,c("topic")], rating = tpTrain[,c("rating")])


fit_tp_train <- stan(file = 'tp_train.stan', data = tp_stan_train, iter = 200, chains = 1, verbose = TRUE)

la_tp_train <- extract(fit_tp_train, permuted = TRUE)

## Apply Logistic Regression using model <- glm( rating ~ location+topic+age+sex)

Error I get is the following
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
error in ‘model328474b5861_tp_train3’ at line 18, column 2

16:   simplex [N] beta;
17:   simplex [N] gamma;
18:   alpha = softmax (age+sex);
     ^
19:   beta = softmax (age+sex) ;

PARSER EXPECTED: <one of the following:
  a variable declaration, beginning with type,
      (int, real, vector, row_vector, matrix, unit_vector,
       simplex, ordered, positive_ordered,
       corr_matrix, cov_matrix,
       cholesky_corr, cholesky_cov
  or '}' to close variable declarations>
Error in stanc(filename, allow_undefined = TRUE) : 
  failed to parse Stan model 'tp_train3' due to the above error.

If I assign values to alpha in transformed parameters block I get the error that I can’t assign a value to variable declared in another block. Please help.

I am not sure how to set the value of alpha which should be a simplex as the softmax(age + sex) in the Stan code ?

1 Like

Sorry for taking long to respond. You need to move both the declaration of alpha and the assignment into the transformed parameters block. Also note that the brms package supports multinomial models via formula syntax using Stan as the backend. It makes many common tasks easier so unless your aim is to learn Stan in order to be able to write more complex models yourself, it might be a better choice.

Best of luck with your model!

1 Like

@martinmodrak Thank you so much. The code works now. I started looking into brms since I was facing errors in Stan and brms definitely makes the things a lot easier