Okay. Let’s start simple and work our way up.
The one huge simplification we are doing is that we assume everyone had 8 trials. I don’t know if that is the correct assumption. You said something that some people were slow and could have their 8th trial… I’d say probably assuming 8 possible trials is fair enough (could be wrong) because being able to move through trials quickly might be a part of “performance”. (I could be completely wrong here.) In any case, you could actually correct the code for different trial numbers.
Anyway, this is how I load the data.
library(tidyverse)
toydata <- read_csv("data/toydata.csv")
names(toydata) <- c("subject_id", paste0("block_", names(toydata)[2:15]))
toydata <- toydata %>% mutate(subject_id = 1:n())
data_mat <- toydata %>% select(starts_with("block")) %>% as.matrix()
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
data_list <- list(
N = nrow(data_mat),
T = ncol(data_mat),
y = data_mat
)
Starting simple, here is a model that estimates average performance over subjects and over time. It’s not the most efficient code, but the data set is pretty small…
mod0.stan
:
data{
int<lower=1> N;
int<lower=1> T;
int<lower=0, upper=8> y[N, T];
}
transformed data{
int<lower=1, upper=8> trials[N ,T]; // this could be fed in as data
for(t in 1:T)
for(n in 1:N)
trials[n, t] = 8;
}
parameters{
real<lower=0,upper=1> performance;
}
model{
for(n in 1:N)
y[n] ~ binomial(trials[n], performance); // vectorize over T
}
The result is kind of the grand mean… not particularly interesting…
> mod0 <- stan(file = "model/mod0.stan", data = data_list)
> mod0
Inference for Stan model: mod0.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
performance 0.55 0.00 0.01 0.53 0.54 0.55 0.55 0.56 1691 1
lp__ -2239.53 0.02 0.70 -2241.47 -2239.70 -2239.28 -2239.10 -2239.05 1682 1
Now, let’s estimate performance for each subject separately.
mod1.stan
:
data{
int<lower=1> N;
int<lower=1> T;
int<lower=0, upper=8> y[N, T];
}
transformed data{
int<lower=1, upper=8> trials[N ,T]; // could be data
for(t in 1:T)
for(n in 1:N)
trials[n, t] = 8;
}
parameters{
real<lower=0,upper=1> performance[N];
}
model{
for(n in 1:N)
y[n] ~ binomial(trials[n], performance[n]); // again, vectorize over T
}
Let’s plot the results:
Tiles are subject (I numbered them differently to account for the one missing, see R code above), points are (sort of) the performance data points that you have in your data set (but assuming 8 trials for every block), and the blue lines/areas are the posterior probabilities of the time/block-constant, subject-varying performance means.
Now, let’s vary performance by all subject-block combinations.
mode2.stan
:
data{
int<lower=1> N;
int<lower=1> T;
int<lower=0, upper=8> y[N, T];
}
transformed data{
int<lower=1, upper=8> trials[N ,T];
for(t in 1:T)
for(n in 1:N)
trials[n, t] = 8;
}
parameters{
real<lower=0,upper=1> performance[N, T];
}
model{
for(n in 1:N)
for(t in 1:T)
y[n, t] ~ binomial(trials[n, t], performance[n, t]);
}
This kind of replicates the data, but there is no real structure. Note, that if the number of trials would vary, you’s see slightly greater variance for those observations with less trials, as they “contain less information”.
Okay… Now, to your model. Actually, I had difficulties to fit the function that you have proposed… I set the minval to 0 and that worked out fine…
´´mode3.stan``:
data{
int<lower=1> N;
int<lower=1> T;
int<lower=0, upper=8> y[N, T];
}
transformed data{
int<lower=1, upper=8> trials[N ,T];
for(t in 1:T)
for(n in 1:N)
trials[n, t] = 8;
}
parameters{
// all parameters vary by subject
real<lower=0, upper=1> maxval[N];
real<lower=0> steepness[N];
real<lower=0> inflection[N];
}
transformed parameters{
real<lower=0, upper=1> performance[N, T];
// specefy performance as a logistic function
for(n in 1:N)
for(t in 1:T)
performance[n, t] = maxval[n]*(1/(1+exp(-steepness[n]*(t-inflection[n]))));
}
model{
for(n in 1:N)
for(t in 1:T)
y[n, t] ~ binomial(trials[n, t], performance[n, t]);
// these are some rough priors...
// I got those from your suggestions and eyeballing...
maxval ~ beta(9, 1);
steepness ~ exponential(7); // this is on a much tighter scale than your guess in the first post
inflection ~ gamma(7, 1);
}
This is the result as a plot:
From here you can further develop the model. Like incorporating the different trial numbers, reformulate the logistic function (minval?), setting more reasonable priors, or making the model hierarchical…
Hope this gives you a good starting point.
Cheers!
Max
PS: Tell me if you need the code for the plots.
EDIT:
I also quickly did the model with (estimated) min value:
data{
int<lower=1> N;
int<lower=1> T;
int<lower=0, upper=8> y[N, T];
}
transformed data{
int<lower=1, upper=8> trials[N ,T];
for(t in 1:T)
for(n in 1:N)
trials[n, t] = 8;
}
parameters{
real<lower=0, upper=1> minval[N];
real<lower=0, upper=1> maxval[N];
real<lower=0> steepness[N];
real<lower=0> inflection[N];
}
transformed parameters{
real<lower=0, upper=1> performance[N, T];
for(n in 1:N)
for(t in 1:T)
performance[n, t] = (maxval[n] - minval[n])*(1/(1+exp(-steepness[n]*(t-inflection[n])))) + minval[n];
}
model{
for(n in 1:N)
for(t in 1:T)
y[n, t] ~ binomial(trials[n, t], performance[n, t]);
minval ~ beta(1, 9);
maxval ~ beta(9, 1);
steepness ~ exponential(7);
inflection ~ gamma(7, 1);
}
PPS: Looks like subject_id 21 to 29 are just replicates of subject_id 2 to 10…