# Outcome contest model - conflict between predictions and parameter estimates

TL;DR: when fitting an outcome contest model to experimental data, the parameter estimation indicates a negative effect of the main predictor, but predicted outcomes indicate a positive effect.

I am working on a cultural transmission experiment, where stimuli are produced within chains (participant p1 looks at stimulus 0 and reproduces it, the output of that gets given to participant p2, who reproduces it, etc.). We hypothesize that stimuli at later generations are “refined” to be more perceptually salient to humans (see here for a prev paper on this: https://www.pnas.org/doi/10.1073/pnas.1910880117).

We therefore build a 2 alternatives forced choice task, where all pairs of stimuli are presented to new participants and they have to indicate which is the more salient (or other properties, it doesn’t matter here).

The process can be simulated as follows:

pacman::p_load(
tidyverse,
brms,
patchwork)

## Define amount of data
items <- 30
chains <- 5
raters <- 30

## Define parameter values
GenerationB <- .4
sigma_raters <- .4
sigma_items <- .4
sigma_chains <- .2

# Create dataset with stimuli, with generation and chain ids
d_items <- tibble(
item = 1:items,
item_intercept = rnorm(items, 0, sigma_items),
generation = rep(-1:1, 10),
chain = rep(seq(chains), each = items/chains),
chain_slope = NA
)

chain_slope = rnorm(chains, 0, sigma_chains)

for (i in seq(items)) {
d_items$chain_slope[i] <- chain_slope[d_items$chain[i]]
}

# Create dataset with raters
d_raters <- tibble(
ID = 1:raters,
rater_slope = rnorm(raters, 0, sigma_raters)
)

## Now create all possible combinations between the stimuli
combinations <- t(combn(1:items, 2))

df_recovery <- NULL
for (rater in seq(raters)) {
print(rater)
temp <- tibble(
stimulusR = combinations[,1],
stimulusL = combinations[,2],
chainR = NA,
chainL = NA,
generationR = NA,
generationL = NA,
intR = NA,
intL = NA,
diff = NA,
thetaR = NA,
choiceR = NA
)

for (i in seq(nrow(combinations))) {

temp$chainR[i] <- d_items$chain[temp$stimulusR[i]] temp$chainL[i] <- d_items$chain[temp$stimulusL[i]]
temp$generationR[i] <- d_items$generation[temp$stimulusR[i]] temp$generationL[i] <- d_items$generation[temp$stimulusL[i]]

temp$intR[i] <- d_items$item_intercept[temp$stimulusR[i]] + (GenerationB + d_items$chain_slope[temp$stimulusR[i]] + d_raters$rater_slope[rater]) * d_items$generation[temp$stimulusR[i]]

temp$intL[i] <- d_items$item_intercept[temp$stimulusL[i]] + (GenerationB + d_items$chain_slope[temp$stimulusL[i]] + d_raters$rater_slope[rater]) * d_items$generation[temp$stimulusL[i]]

temp$diff[i] = temp$intR[i] - temp$intL[i] temp$thetaR[i] = inv_logit_scaled(temp$diff[i]) temp$choiceR[i] = rbinom(1, 1, temp$thetaR[i]) temp$rater[i] = rater
}

if (exists("df_recovery")) {df_recovery <- rbind(df_recovery, temp)} else {df_recovery <- temp}

}

# Visualize effects of generation across all data
p1 <- df_recovery %>%
group_by(generationL, generationR, stimulusR) %>%
summarize(x = mean(choiceR)) %>%
mutate(generationR = as.numeric(generationR), generationL = as.numeric(generationL)) %>%
ggplot(aes(generationR, x)) +
geom_point() + geom_smooth(aes(generationR, x)) +
facet_wrap(.~ generationL) + theme_bw() + theme(legend.position = "none")

# Visualize effects of generation by participant
p2 <- df_recovery %>%
group_by(generationL, generationR, stimulusR, rater) %>%
summarize(x = mean(choiceR)) %>%
mutate(generationR = as.numeric(generationR), generationL = as.numeric(generationL)) %>%
ggplot(aes(generationR, x, group = rater, color = rater)) +
geom_point() + geom_smooth(aes(generationR, x), method = lm, se = F) +
facet_wrap(.~ generationL) + theme_bw() + theme(legend.position = "none")

# Visualize effects of generation by chain
p3 <- df_recovery %>%
group_by(generationL, generationR, stimulusR, chainR) %>%
summarize(x = mean(choiceR)) %>%
mutate(generationR = as.numeric(generationR), generationL = as.numeric(generationL)) %>%
ggplot(aes(generationR, x, group = chainR, color = chainR)) +
geom_point() + geom_smooth(aes(generationR, x), method = lm, se = F) +
facet_wrap(.~ generationL) + theme_bw() + theme(legend.position = "none")

# combine the three plots
p1/p2/p3

# prepping for Stan:
d1_sim <- list(
N_choices = nrow(df_recovery),
N_items = items,
N_chains = chains,
N_raters = raters,
y = as.numeric(df_recovery$choiceR), item1 = as.numeric(df_recovery$stimulusR),
item2 = as.numeric(df_recovery$stimulusL), rater = as.numeric(df_recovery$rater),
generation = d_items$generation, chain = d_items$chain
)


On the x-axis the generation for the right stimulus. On the y-axis the propensity to choose the right stimulus. The plots are split by generation of left stimulus (panels / columns). The first row is the overall pattern in the data. The second splitting the effect of generation by participant. The third by chain (of the right stimulus).

I then build an outcome contest model

// contest model

functions{
real normal_lb_rng(real mu, real sigma, real lb) {
real p = normal_cdf(lb | mu, sigma);  // cdf for bounds
real u = uniform_rng(p, 1);
return (sigma * inv_Phi(u)) + mu;  // inverse cdf for value
}
}

data{
int N_choices;  // amount of data points
int N_items;  // unique stimuli
int N_raters;  // unique raters
int N_chains;  // unique chains
array[N_choices] int y;  // outcome
array[N_choices] int item1;  // which stimulus 1 in each trials
array[N_choices] int item2;  // which stimulus 2 in each trials
array[N_choices] int rater;  // which rater in each trials
array[N_items] int chain; // item features: chain
array[N_items] int generation;      // item feature: generation
}

parameters{
real GenerationB;

vector[N_items] StimulusIntercept;             // random intercept of item
real<lower=0> sigma_items;                   // to standardize stimulus lvl deviations

vector[N_raters] GenerationSubject;             // random slope of generation over raters
real<lower=0> sigma_raters;                   // to standardize participant lvl deviations

vector[N_chains] GenerationChain;             // random slope of generation over raters
real<lower=0> sigma_chains;                   // to standardize chain lvl deviations
}

model{
vector[N_choices] logit_p;
real wx; // partial effect of generation
vector[N_choices] a1; // intentionality score of item 1
vector[N_choices] a2; // intentionality score of item 2

// Priors
target += normal_lpdf(StimulusIntercept | 0, 1); // Intercept by Stimulus
target += normal_lpdf(sigma_items | 0, 1) -
normal_lccdf(0 | 0, 1); // Variability of stimulus variability

target += normal_lpdf(GenerationB | 0, 1); // Effect of Generation
target += normal_lpdf(GenerationSubject | 0, 1); // Individual variations in effects of generation
target += normal_lpdf(GenerationChain | 0, 1); // Individual variations in effects of generation
target += normal_lpdf(sigma_raters  | 0, .5) -
normal_lccdf(0 | 0, .5); // Variability of rater variability
target += normal_lpdf(sigma_chains | 0, .5) -
normal_lccdf(0 | 0, .5); // Variability of rater variability

for ( i in 1:N_choices ) {

wx = GenerationB + GenerationSubject[rater[i]] * sigma_raters;

a1[i] = StimulusIntercept[item1[i]] * sigma_items +
(wx + GenerationChain[chain[item1[i]]] * sigma_chains) * generation[item1[i]]; // estimate probability of stimulus on the right to be chosen (in absolute)

a2[i] = StimulusIntercept[item2[i]] * sigma_items +
(wx + GenerationChain[chain[item2[i]]] * sigma_chains) * generation[item2[i]];  // estimate probability of stimulus on the left to be chosen (in absolute)

logit_p[i] = a1[i] - a2[i];  // calculate probability for stimulus on the right to be chosen

target += bernoulli_logit_lpmf(y[i] | logit_p[i]);

}

}

generated quantities {
real StimulusIntercept_prior;
real sigma_items_prior;

real GenerationB_prior;
real GenerationSubject_prior;
real GenerationChain_prior;
real sigma_raters_prior;
real sigma_chains_prior;
//real sigma_chains_int_prior;

vector[N_choices] log_lik;
vector[N_choices] logit_p;
real wx;
vector[N_choices] a1;
vector[N_choices] a2;

StimulusIntercept_prior = normal_rng(0, 1);
GenerationB_prior = normal_rng(0, 1);
GenerationSubject_prior = normal_rng(0, 1);
GenerationChain_prior = normal_rng(0, 1);

sigma_items_prior = normal_lb_rng(0, 1, 0);
sigma_raters_prior = normal_lb_rng(0, .5, 0);
sigma_chains_prior = normal_lb_rng(0, .5, 0);
//sigma_chains_int_prior = normal_lb_rng(0, 1, 0);

for (i in 1:N_choices) {

wx = GenerationB + GenerationSubject[rater[i]] * sigma_raters;

a1[i] = StimulusIntercept[item1[i]] * sigma_items + (wx + GenerationChain[chain[item1[i]]] * sigma_chains) * generation[item1[i]]; // estimate probability of stimulus on the right to be chosen (in absolute)
a2[i] = StimulusIntercept[item2[i]] * sigma_items + (wx + GenerationChain[chain[item2[i]]] * sigma_chains) * generation[item2[i]];  // estimate probability of stimulus on the left to be chosen (in absolute)
logit_p[i] = a1[i] - a2[i];  // calculate probability for stimulus on the right to be chosen
log_lik[i] = bernoulli_logit_lpmf(y[i] | logit_p[i]);

}

}


The model fits the data nicely and recovers the right parameter values.
I then fit it to our experimental data (Dropbox - exp2_prepped_data.csv - Simplify your life).
Model quality checks all work (no divergences, rhat and other diagnostics are in place, chains are fine, prior and posteriors look great):

The posterior estimates for our predicted effect are the opposite of what we expect (later generations are less salient, negative estimate of GenerationB), but that’d be life. However, when I produce model predictions, the model predictions indicate that with later generation comes more salience.

(upper panel, lines indicate different raters, lower panel, lines indicate different chains).

Further, if I then simulate data with the mean estimates from the model, the simulated data don’t look like the predicted data.

GenerationB <- -0.1 #.4
sigma_raters <- 0.27 #.4
sigma_items <- 1 #.4
sigma_chains <- 0.13 #.2

# then code like in the first chunk above


Nevertheless, if I check whether the predicted data (from the model) and the actual data match, they do and very closely.

The plot simply maps actual against predicted data

the plots overlay the distributions of predicted ®d) and actual data (blue).

So I am very puzzled as to what I am doing wrong. I also checked whether the presentation of the stimuli was not balanced, but that was not the case. Any further suggestion?

2 Likes

There’s a lot going on there and I haven’t spent enough time going through all of it, but since there are no replies this far I’ll weigh in. I apologize in advance for missing something or any nonsense.

First, this called my attention:

When you say there’s a discrepancy are you comparing mean parameter values and mean model output?

Since the mean parameter values are generally not a sample from the chain and don’t necessarily constitute a parameter set with high likelihood, the output from those values may not fit the data at all and simulating from those values is likely to produce output very different from it.

If that’s the case, the simple explanation would be that the expected value for the model output is not a function of the expected value of the model parameters – in the former expectation the output is computed from all/many sets of parameters sampled and only then is the expected value taken, while in the latter the expected value is computed for parameters (and while you can compute a model prediction from it, I’d say normally you wouldn’t).

I’m not sure I understand, as these sound contradictory, but if again you are comparing simulations from the posterior (e.g. via generated quantities block) and simulated from the mean parameters the issue would be the same as above.

1 Like

Thanks for the answer. And yes, what I have is:

• actual data (experimentally elicited from participants)
• predicted data (outcome predictions generated via the generated quantities blocks, based on the exact stimuli and conditions from the actual data, a subset of all possible combinations)
• simulated data (based on the mean posterior parameter estimates, and a systematic fully counterbalanced set of stimuli/conditions)

Actual data and predicted data match very well.
Simulated data do not match in their general pattern with either actual or predicted data.

You say:

Yes, and that’s my current headache. This could be due: i) to an error in my code (which I can’t rule out, but I have triple-checked); ii) to a complex relation between parameters and model outcome.
Assuming it’s the second, I’m trying to identify where this happens and how to then interpret the model: i) the model parameter estimation suggests a negative effect of generation; ii) the predicted outcomes suggest a positive effect of generation. In order to reconcile these two things, I need to provide a detailed explanation of how this happens.

Thanks for clarifying. Based on that I’d say there is no inconsistency (so far):

The first two are meant to agree, because the predictions sampled from the posterior are the ones found to have high probability. If they didn’t you’d have a problem (coding error or a very poor fit), and if you triple checked the model there’s no obvious reason to think there’s something wrong.

This is also expected in many cases, unless the posterior is highly symmetric and sharp (in which case the expected value would be close to the MAP value and simulation from either would be a good fit to the data), there’s no reason to think they must match in general.

Assuming we can discard errors, I gather this is a linear model (though it still goes through a logit function), so there shouldn’t be much complexity in the prediction given the predictor of interest. I’d have to look much closer to try and be sure, but a couple of thoughts that could reconcile the (hopefully not real, just apparent) contradiction: (i) since the expected value of the parameter is negative it must be because there are larger/more negative values than positive ones, however (ii) their effect on the model output may not be proportional to their magnitude (e.g. larger parameter values may get flattened) so maybe more small positive parameter values will cause prediction to be positive despite a few large negative values.

I am not sure this is common, but I can think of a couple of (maybe not very realistic) examples, for instance a function f(x) = min(|x|,\theta_{max}), where large negative magnitudes would not necessarily produce a negative prediction. Maybe you could try split the parameters into two sets of positive and negative values – the sum of those would straightforwardly give you the mean – but then also compute the model output from each set separately, if the set of negative values is smaller and you find that the predictions from that set alone do not produce strong trends that could be an explanation. Another test you could to is to reflect either the positive or negative set to get a symmetric (made-up) distribution to build intuition on what to expect of the model posterior, maybe better yet, just compute the output of the model for values ranging from the most negative to the most positive parameter values.

Maybe there’s a better way of assessing thing, but since there is nothing that seems really off I don’t see much else than to understand what the model is supposed to produce from the a range of parameter values (I’m also ignoring the effect of other parameters, assuming they are not doing anything crazy, but in general other parameters could add to complexity that generates unintuitive results).

1 Like