Modelling attitude in comments using sentiment analysis score & upvotes

Hi there. So I have a bunch of comments from a forum that all relate to a certain topic. I want to see if there was a noticeable difference in the forum users’ attitude towards that topic after a certain date. To estimate that, I’ve done sentiment analysis on the text content (sentiment) & I also have a scaled upvote/downvote score (score_combined).

Now, I’m assuming that users’ attitude both affect the sentiment of comments positively & the upvote/downvote score of comments based on the sentiment (so, if attitudes are positive, negative comments are downvoted & positive comments are upvoted). So I have two outcome variables here, one of which affects the other one.

I’m inexperienced w statistical modelling, so here was my initial attempt to model this:

data{
    vector[656] score_combined;
    vector[656] sentiment;
    int is_after_date[656];
}
parameters{
    real a1;
    real a2;
    vector[2] b;
    real c;
    real<lower=0> sigma1;
    real<lower=0> sigma2;
}
model{
    vector[656] mu1;
    vector[656] mu2;
    sigma2 ~ exponential( 1 );
    sigma1 ~ exponential( 1 );
    c ~ normal( 0 , 100 );
    b ~ normal( 0 , 10 );
    a2 ~ normal( 0 , 100 );
    a1 ~ normal( 0 , 100 );
    for ( i in 1:656 ) {
        mu2[i] = a2 + b[is_after_date[[i]] * c;
    }
    sentiment ~ normal( mu2 , sigma2 );
    for ( i in 1:656 ) {
        mu1[i] = a1 + b[is_after_date[[i]] * sentiment[i];
    }
    score_combined ~ normal( mu1 , sigma1 );
}

So my thinking here is that b kind of maps users’ attitude towards the topic, & that parameter influences both comment sentiment & upvote/downvote score of comments based on their sentiment. Is this at all about the right way of thinking on this sort of problem? Am I in the right neighbourhood? Would of course appreciate any help.

1 Like

I think I made some progress & have a model that makes sense to me & gives sensible results. Here it is:

data{
    vector[749] score_combined;
    int is_after_date[749];
    vector[749] sentiment;
}
parameters{
    real a;
    vector[6] mu2;
    real<lower=0> sigma1;
    real<lower=0> sigma2;
}
model{
    vector[749] mu1;
    sigma2 ~ exponential( 1 );
    sigma1 ~ exponential( 1 );
    mu2 ~ normal( 0 , 10 );
    a ~ normal( 0 , 100 );
    sentiment ~ normal( mu2[is_after_date] , sigma2 );
    for ( i in 1:749 ) {
        mu1[i] = a + mu2[is_after_date[i]] * sentiment[i];
    }
    score_combined ~ normal( mu1 , sigma1 );
}
generated quantities{
    vector[749] log_lik;
    vector[749] mu1;
    for ( i in 1:749 ) {
        mu1[i] = a + mu2[is_after_date[i]] * sentiment[i];
    }
    for ( i in 1:749 ) log_lik[i] = normal_lpdf( score_combined[i] | mu1[i] , sigma1 );
}

Here are the results I’m getting:

        mean   sd  5.5% 94.5% n_eff Rhat4
a       0.06 0.04  0.00  0.11  3012     1
mu2[1] -0.01 0.01 -0.03  0.01  3397     1
mu2[2] -0.03 0.01 -0.04 -0.02  3394     1
sigma1  1.03 0.03  0.99  1.07  2986     1
sigma2  0.19 0.00  0.19  0.20  2575     1

As I mentioned I’m not that well versed in this stuff, though, so could still very well be that my model is bad somehow. If that’s the case, do please let me know. I guess I also need to figure out how to actually interpret mu2[1] and mu2[2], how to compare them … wouldn’t mind pointers about that, either!

1 Like

Hi and sorry for not getting to you earlier.

At first glance the idea of the model does not seem completely unreasonable, but you obviously understand your data better. Some comments:

vector[6] mu2;

If i get it right, you only ever use mu2[1] and mu2[2], right? Why do you put 6 elements in mu2?

    sentiment ~ normal( mu2[is_after_date] , sigma2 );
    for ( i in 1:749 ) {
        mu1[i] = a + mu2[is_after_date[i]] * sentiment[i];
    }

Here, you are using mu2 twice to inform mu1 - effectively this is the same as having sentiment_raw ~ normal(0, sigma2) and mu1[i] = a + mu2[is_after_date[i]] * (mu2[is_after_date[i]] + sentiment_raw[i]). This feels weird (but maybe you have a reason for this?) A more “classical” way to handle this would be to have a separate parameter for the influence of sentiment on mu1 e.g. mu1[i] = a + beta2[is_after_date[i]] * sentiment[i]. Also, your model doesn’t allow for time trends irrespective of sentiment and for sentiment to effect score irrespective of time (in stats parlance you only fit an interaction, not the main effect) which makes you estimates harder to intepret.

Does that make at least some sense?

Thanks for your reply @martinmodrak , & no worries, I think I managed to end up w something that kind of maybe works (?) on my own.

Sorry, that wasn’t clear from my description. I’m actually looking at the differences for 3 different topics, each before & after the date. So that means I have 6 different subpopulations. But I wanted to keep it simple so I tried to bracket that. Seems it just confused things, though. Irony!

So, the badly named mu2 is the thing I want to predict, the attitude. I figured I needed it in there twice, bc it both affects a comment’s sentiment as well as, depending on its sentiment, the comment’s upvote/downvote score. But I was not sure on how to do that & may well have done something weird.

Here’s how I modeled it at first using rethinking’s ulam:

    vote_score ~ dnorm(mu1, sigma1),
    mu1 <- a + mu2 * sentiment,
    sentiment ~ dnorm(mu2, sigma2),
    
    # Priors
    a ~ dnorm(0, 10),
    mu2 ~ dnorm(0, 1.5),
    sigma1 ~ dexp(1),
    sigma2 ~ dexp(1)

However, then I got some warnings from Stan & followed an example in the book to reparametrize it:

    vote_score ~ dnorm(mu1, sigma1),
    mu1 <- a + (mu2 * sigma3) * sentiment,
    sentiment ~ dnorm(mu2 * sigma3, sigma2),
    
    # Priors
    a ~ dnorm(0, 10),
    z[network] ~ dnorm(0, 1),
    mu2_bar ~ dnorm(0, 1.5),
    sigma1 ~ dexp(1),
    sigma2 ~ dexp(1),
    sigma3 ~ dexp(1),
    
    gq> vector[network]:mu2 <<- mu2_bar + z * sigma3

Those two should be the same but maybe it’s easier to read the first one, not sure. Anyway, I see how it’s weird though.

In this version, isn’t mu1 (the score) affected by mu2 (the attitude) only via the sentiment? So that, if users had a strongly negative attitude, and there was an outlier comment that was very positive, then the model would only have the sentiment of the comment (positive) to go on when predicting the upvote/downvote score. So it would predict a positive score, even though users had a very negative attitude. & vice versa. Or am I mistaken in my thinking here?

Appreciate the help!

That’s honestly a bit tricky. Obviously you know your domain and data better than me, so it is your call. The method you use is definitely a bit non-standard, but that’s not bad in itself (many standard approaches are terrible :-). Since the sentiment and mu map 1-to-1 (if I get it correctly) the sequence:

  sentiment ~ normal( mu2[is_after_date] , sigma2 );
    for ( i in 1:749 ) {
        mu1[i] = a + mu2[is_after_date[i]] * sentiment[i];
    }

is (based on properties of the normal distribution) IMHO equivalent to:

  sentiment ~ normal( 0 , sigma2 );
for ( i in 1:749 ) {
        mu1[i] = a + mu2[is_after_date[i]] * (mu2[is_after_date[i]] + sentiment[i]);
    }

so you are getting a quadratic dependence on mu2 with a fixed coefficent of 1 which I am not sure you want. If you wanted a quadratic effect, it might make more sense to let the magnitude of the term be a variable - and the same for all the other terms i.e. you’d have:

mu1[i] = a + b1 * mu2[is_after_date[i]]^2 +  b2 * mu2[is_after_date[i]] * sentiment[i] + b3*sentiment[i] + b4 * mu2[is_after_date[i]];

Note that I allow the sentiment to alter mu1 both separately and in interaction with mu2.

The above would get you close to standard linear regression analysis. Once again - this is not the “one true way” to do this, just a more standard one.

Dose that make sense?

1 Like