Hello,
I am currently trying to apply an ordinal model to my outcome variable: sentiment score (positive, negative, neutral).
I want to look whether there is a relationship between the way a question is asked (positive, negative, neutral wording) and the sentiment of the response. I have 2638 people asked a question about symptoms. 1/3 of the people were asked it with a negative wording, 1/3 with a neutral one, 1/3 with a positive one. From this, I did sentiment analysis (using Trincker’s package) to see whether their responses were more positive or negative, depending on the wording of the question.
Sentiment analysis breaks down responses into sentences, so I have 2638 people, but 7924 sentences, so I would assume to fit ID as a random effect.
The big question is: does the way the question is asked (primetype) affect the sentiment of the response?
Here is a subset of my data:
structure(list(agequartiles = structure(c(1L, 3L, 2L, 1L, 2L,
4L, 3L, 1L, 3L, 4L, 1L, 2L, 2L, 2L, 4L, 1L, 3L, 3L, 4L, 4L, 4L,
3L, 4L, 1L, 4L, 3L, 1L, 4L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 3L, 2L,
2L, 3L, 4L, 4L, 3L, 2L, 3L, NA, 1L, 1L, 1L, 2L, 2L), .Label = c("[18,23]",
"(23,27]", "(27,32]", "(32,54]"), class = "factor"), sentiment = c(1,
1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1,
1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 2, 1,
1, 2, 1, 1, 3, 1, 3), group = structure(c(2L, 3L, 3L, 2L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 1L, 3L, 2L, 2L, 1L, 3L,
1L, 3L, 2L, 1L, 2L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 2L, 3L,
3L, 3L, 3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 2L), .Label = c("prime1",
"prime2", "prime3"), class = "factor"), continent = c("UK", "Australia and New Zealand",
"Northern America", "UK", "Northern America", "Australia and New Zealand",
"Asia and the Pacific", "UK", "Southern and Central America",
"Australia and New Zealand", "UK", "Northern America", "Northern America",
"UK", "Northern America", "UK", "UK", "Northern America", "UK",
"Northern America", "Northern America", "Southern and Central America",
"Northern America", "UK", "Europe", "Northern America", "UK",
"Northern America", NA, "UK", "UK", "Australia and New Zealand",
"Australia and New Zealand", "UK", "UK", "UK", "Australia and New Zealand",
"Northern America", "UK", "Northern America", "UK", "Asia and the Pacific",
"Northern America", "Northern America", NA, NA, "UK", "Europe",
"UK", "Northern America"), ID = 1:50, medication = c("FALSE",
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE",
"FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE",
"FALSE", "FALSE", "TRUE", "TRUE", "FALSE", "FALSE", "FALSE",
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",
"FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "TRUE", "FALSE", "FALSE",
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "TRUE",
"FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "TRUE")), row.names = c(NA,
50L), class = "data.frame")
this is my workflow:
- I chose a link function using this tutorial The link chosen was cloglog.
- Chose a model using this tutorial tutorial and chose Acat with category specific effects.
- Imputed data
- Ran models with different predictors
- chose model
I imputed missing data using the missRanger package (mice wouldn’t work)
library(missRanger)
data <- lapply(3456:3460, function(x)
missRanger(
data,
. #predict all columns
~ . #Make predictions using all columns except:
- ID,
maxiter = 10,# How many iterations until it stops?
pmm.k = 3, #Predictive Mean Matching leading to more natural imputations and improved distributional properties of the resulting values
verbose = 1,#how much info is printed to screen,
seed = x,#Integer seed to initialize the random generator.
num.trees = 200,
returnOOB = TRUE,
case.weights = NULL
)
)
Models I ran:
models_group <- brm_multiple(formula = sentiment ~ 1 + cs(group), data = data, family = acat("cloglog"), combine=TRUE, chains=4)
models_meds <- brm_multiple(formula = sentiment ~ 1 + cs(group)+ medication, data = data, family = acat("cloglog"), combine=TRUE, chains=4)
models_age <- brm_multiple(formula = sentiment ~ 1 + cs(group)+age, data = data, family = acat("cloglog"), combine=TRUE, chains=4)
models_continent <- brm_multiple(formula = sentiment ~ 1 + cs(group)+continent, data = data, family = acat("cloglog"), combine=TRUE, chains=4)
models_all<-models_age <- brm_multiple(formula = sentiment ~ 1 + cs(group) +age +medication+continent, data = data, family = acat("cloglog"), combine=TRUE, chains=4)
then I tried to use LOO to see which model worked best
my questions:
- is that a reasonable work flow?
- I still don’t understand priors, I have read tutorials, and posts on here but I still don’t understand, or how to work out what to set mine as?
- In my data, the models that include age don’t work, they converge and Rhat is more than 1.1, should I do more chains?
thank you for reading