Multinomial distribution for modelling Causal Models with all categorical variables

sam_learner · February 17, 2020, 11:36am

I am beginner using Stan for modelling structural causal models. I would be extremely grateful if someone could please help with the following query. I have looked in the Stan docs as well as the forums, but I wasn’t able to find the answer.

My data contains only categorical variables and I am trying to model the causality between different variables using multinomial distribution. Then send the data from Stan to R to predict rating using all the remaining attributes (topic, location,age,sex) using logistic regression.

The data is as follows :


location     age       sex       topic               rating
denmark    under 35     M        Computers             1      
germany    over 45      F        Cars                  3      
denmark    over 45      M        Electronics           4      
france     under 35     F        Plants                5

I am trying to model the following equations :
Location \sim multinomial(age+sex)
topic \sim multinomial(age+sex)
rating \sim multinomial(age+sex)

The Stan code is as follows :


data{
  int<lower = 0> N; // number of instances in the data
  vector[N] age; // age
  vector[N] sex; //sex
  int location[N]; // location
  int topic[N]; // topic
  int rating[N]; //  rating 
  
  
}
parameters{
  real attitude;
  simplex[N] alpha;
  simplex [N] beta;
  simplex [N] gamma;
  alpha = softmax (age+sex);
  beta = softmax (age+sex) ;
  gamma = softmax (age+sex);
}

model{
 location ~ multinomial( alpha)
  topic ~multinomial(beta)
rating ~multinomial (gamma)
}

The R code is as follows:

raw_data <- read.csv("tp_data.csv")
str(raw_data)
tp <- dplyr::select(raw_data,age,location,sex,rating,topic) 

#Converting Categorical variables to numeric 

tp$location <- as.numeric(as.factor(tp$location))
tp$rating <- as.numeric(as.factor(tp$rating))
tp$topic<- as.numeric(as.factor(tp$topic))
tp$age <- as.numeric(as.factor(tp$age))
tp$sex <- as.numeric(as.factor(tp$sex))


#A series of test/training partition creation and p is percentage of data into training 
trainIndex <- createDataPartition(tp$location, p = .8, 
                                  list = FALSE, 
                                  times = 1)
tpTrain <- tp[trainIndex,]
tpTest  <- tp[-trainIndex,]


n <- nrow(tpTrain)
ne <- nrow(tpTest)



# ------------------------------
tp_stan_train <- list(N = n, age = tpTrain[,c("age")], sex = tpTrain[,c("sex")],
                       location = tpTrain[,c("location")], topic = tpTrain[,c("topic")], rating = tpTrain[,c("rating")])


fit_tp_train <- stan(file = 'tp_train.stan', data = tp_stan_train, iter = 200, chains = 1, verbose = TRUE)

la_tp_train <- extract(fit_tp_train, permuted = TRUE)

## Apply Logistic Regression using model <- glm( rating ~ location+topic+age+sex)

Error I get is the following
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
error in ‘model328474b5861_tp_train3’ at line 18, column 2

16:   simplex [N] beta;
17:   simplex [N] gamma;
18:   alpha = softmax (age+sex);
     ^
19:   beta = softmax (age+sex) ;

PARSER EXPECTED: <one of the following:
  a variable declaration, beginning with type,
      (int, real, vector, row_vector, matrix, unit_vector,
       simplex, ordered, positive_ordered,
       corr_matrix, cov_matrix,
       cholesky_corr, cholesky_cov
  or '}' to close variable declarations>
Error in stanc(filename, allow_undefined = TRUE) : 
  failed to parse Stan model 'tp_train3' due to the above error.

If I assign values to alpha in transformed parameters block I get the error that I can’t assign a value to variable declared in another block. Please help.

I am not sure how to set the value of alpha which should be a simplex as the softmax(age + sex) in the Stan code ?

martinmodrak · February 24, 2020, 9:26am

Sorry for taking long to respond. You need to move both the declaration of alpha and the assignment into the transformed parameters block. Also note that the brms package supports multinomial models via formula syntax using Stan as the backend. It makes many common tasks easier so unless your aim is to learn Stan in order to be able to write more complex models yourself, it might be a better choice.

Best of luck with your model!

sam_learner · February 24, 2020, 9:32am

@martinmodrak Thank you so much. The code works now. I started looking into brms since I was facing errors in Stan and brms definitely makes the things a lot easier

Topic		Replies	Views
Multinomial Model - How to set Prior for categorical variables? Modeling techniques	3	1244	February 18, 2020
Multinomial regression Rstan Modeling	0	423	August 6, 2019
Multinomial logit in rstan Modeling	1	473	August 8, 2019
Help needed in building a causal prediction model using rstan Modeling	0	248	June 4, 2023
How to write a right Multinomial distribution correctly? Modeling	5	909	February 19, 2020

Multinomial distribution for modelling Causal Models with all categorical variables

Error I get is the following SYNTAX ERROR, MESSAGE(S) FROM PARSER: error in ‘model328474b5861_tp_train3’ at line 18, column 2

Related Topics

Error I get is the following
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
error in ‘model328474b5861_tp_train3’ at line 18, column 2