I am beginner using Stan for modelling structural causal models. I would be extremely grateful if someone could please help with the following query. I have looked in the Stan docs as well as the forums, but I wasn’t able to find the answer.
My data contains only categorical variables and I am trying to model the causality between different variables using multinomial distribution. Then send the data from Stan to R to predict rating using all the remaining attributes (topic, location,age,sex) using logistic regression.
The data is as follows :
location age sex topic rating
denmark under 35 M Computers 1
germany over 45 F Cars 3
denmark over 45 M Electronics 4
france under 35 F Plants 5
I am trying to model the following equations :
Location \sim multinomial(age+sex)
topic \sim multinomial(age+sex)
rating \sim multinomial(age+sex)
The Stan code is as follows :
data{
int<lower = 0> N; // number of instances in the data
vector[N] age; // age
vector[N] sex; //sex
int location[N]; // location
int topic[N]; // topic
int rating[N]; // rating
}
parameters{
real attitude;
simplex[N] alpha;
simplex [N] beta;
simplex [N] gamma;
alpha = softmax (age+sex);
beta = softmax (age+sex) ;
gamma = softmax (age+sex);
}
model{
location ~ multinomial( alpha)
topic ~multinomial(beta)
rating ~multinomial (gamma)
}
The R code is as follows:
raw_data <- read.csv("tp_data.csv")
str(raw_data)
tp <- dplyr::select(raw_data,age,location,sex,rating,topic)
#Converting Categorical variables to numeric
tp$location <- as.numeric(as.factor(tp$location))
tp$rating <- as.numeric(as.factor(tp$rating))
tp$topic<- as.numeric(as.factor(tp$topic))
tp$age <- as.numeric(as.factor(tp$age))
tp$sex <- as.numeric(as.factor(tp$sex))
#A series of test/training partition creation and p is percentage of data into training
trainIndex <- createDataPartition(tp$location, p = .8,
list = FALSE,
times = 1)
tpTrain <- tp[trainIndex,]
tpTest <- tp[-trainIndex,]
n <- nrow(tpTrain)
ne <- nrow(tpTest)
# ------------------------------
tp_stan_train <- list(N = n, age = tpTrain[,c("age")], sex = tpTrain[,c("sex")],
location = tpTrain[,c("location")], topic = tpTrain[,c("topic")], rating = tpTrain[,c("rating")])
fit_tp_train <- stan(file = 'tp_train.stan', data = tp_stan_train, iter = 200, chains = 1, verbose = TRUE)
la_tp_train <- extract(fit_tp_train, permuted = TRUE)
## Apply Logistic Regression using model <- glm( rating ~ location+topic+age+sex)
Error I get is the following
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
error in ‘model328474b5861_tp_train3’ at line 18, column 2
16: simplex [N] beta;
17: simplex [N] gamma;
18: alpha = softmax (age+sex);
^
19: beta = softmax (age+sex) ;
PARSER EXPECTED: <one of the following:
a variable declaration, beginning with type,
(int, real, vector, row_vector, matrix, unit_vector,
simplex, ordered, positive_ordered,
corr_matrix, cov_matrix,
cholesky_corr, cholesky_cov
or '}' to close variable declarations>
Error in stanc(filename, allow_undefined = TRUE) :
failed to parse Stan model 'tp_train3' due to the above error.
If I assign values to alpha in transformed parameters block I get the error that I can’t assign a value to variable declared in another block. Please help.
I am not sure how to set the value of alpha which should be a simplex as the softmax(age + sex) in the Stan code ?