# Incorporating contextual knowledge about semi-supervised experiment

I am currently designing an experiment frequently referred to as a “transmission-chain” experiment.
In this design, an initial group of people reads a text. They then are asked to transmit the information in the text to another unknown individual.
A next wave then reads these generated texts and is asked to transmit the information to another individual.
The entire process is repeated 4 times. An entire sequence of information transmissions is referred to as a “chain”.

Of course, the information quickly decays across the chain. To test hypotheses, researchers manipulate the initial texts and then observe the different rates of information decay in the chains.
For the most part, this is done through multiple t-tests.

I am wondering, if some more information about the design could improve the precision of any analysis here, since it is very expensive to generate good data for these experiments.

One important feature is that information from one position in the chain to the next position can not increase by definition, it can only decrease.

I generated some fake data that follows a negative binomial process where a count of transmitted information has to decrease.
The R code is not great, but it creates reasonable data.

``````library(tidyverse)

mu.t <- c(3,2,1.5,1) # Means of treatment chains
mu.c <- c(4,3,2.5,2) # Means of control chains

chain.sim <- function(vector, N) {
all.list <- list()
all.list[[1]] <- MASS::rnegbin(n = N, mu = vector[1], theta = 1000)

for (i in 2:length(vector)) {
is_smaller <- FALSE
while(!is_smaller){
Y <- MASS::rnegbin(n = N, mu = vector[i], theta = 1000)
if (all(all.list[[i-1]] >= Y)) is_smaller <- TRUE
}
all.list[[i]] <- Y
}
all.list
}

sims.t <- replicate(50, chain.sim(mu.t, 1))
sims.c <- replicate(50, chain.sim(mu.c, 1))

sims.t <- data.frame(sims.t) %>%
rowid_to_column("position") %>%
mutate(treatment="Y")
sims.c <- data.frame(sims.c) %>%
rowid_to_column("position")%>%
mutate(treatment="N")
sims <- bind_rows(sims.t, sims.c) %>%
pivot_longer(!position& !treatment, names_to = "name", values_to = "outcome")  %>%
unite("chain", name:treatment, remove = FALSE) %>%
dplyr::select(-name)

sims\$outcome <- as.numeric(sims\$outcome)
sims\$position <- as.factor(sims\$position)

ggplot(sims, aes(x=position, y=outcome, fill=treatment)) + geom_boxplot()
``````

A first simple model I thought was reasonable is a negative binomial GLM regressing an interaction of treatment and position in the chain on the outcome (eventually transforming it into a hierarchical model).

But my main question is: Is there a possibility to model the inevitable decrease of the outcome variable inside each individual chain of responses?
I am not sure how to do this through priors.
Thank you very much for all help!

Just to summarize my understanding of the question:

• Your response is a count measure that you might think to model using a negative binomial process.
• The response decreases across timesteps.
• It’s not enough to ensure that the expectation of the response decreases through time; you need the actual response to decrease through time, no exceptions.

One approach to this sort of problem would be, instead of modeling the response at each timestep as conditionally independent (given its expectation), to model the differences between timesteps from a distribution that is strictly nonnegative. For example, maybe you could model the response after each transmission as binomially distributed, with a number of trials equal to the response at the previous iteration.

Then you can come up with whatever model might be appropriate for the binomial proportion `p`, including overdispersion terms if necessary, etc.

Edit: this framework also opens the door to even more sophisticated dependency structures. For example, you could specifically track each relevant piece of information (i.e. the response) if you think that the probabilities of different pieces of information being lost aren’t equal.

1 Like

Thanks! You are absolutely correct in your understanding.
I will look into your suggestion about the varying number of trials. Turns out, I was really overthinking my problem.

``````data {
int n;
int n_chain;
int outcome[n];
int diftrial[n]; /// maximum number of possible items (previous successes in the chain)
int treatment[n];
int position[n];
int chain[n];
}
parameters {
real a[2];
real b[n_chain];
real c[4];
}
transformed parameters {
vector<lower=0, upper=1>[n] p;
for (i in 1:n) {
p[i] = inv_logit(a[treatment[i]] + c[position[i]] + b[chain[i]]);
}
}
model {
a ~ normal(0, 1.5);
b ~ normal(0, 1.5);
c ~ normal(0, 1.5);
outcome ~ binomial(diftrial, p);
}
generated quantities {
int y_rep[n] = binomial_rng(diftrial, p);
}

``````

Yet when trying to predict the simulated data, the model is not doing great.

Is that behaviour caused by the cases where the number of trials is 0?

You’re definitely right that the model is currently overpredicting the number of zeros. I have no domain expertise, but some possibilities for why might include

• Might it be fundamentally rare to lose the final piece of information (regardless of what it is–like if I just tell you one single thing perhaps it’s easy enough for you to retain and repeat it)? If so, you might want to let the number of trials itself appear as a covariate on the binomial probability.
• Might there be one piece of information that is rarely lost (like the main point of the original message)? If so, then you might want to track each piece of information individually, with its own per-transmission probability of getting lost.
1 Like