Trying to fit my costumised logistic function

Carina · December 8, 2019, 10:15am

Hello Stan Community,

I have data from 114 subjects that I storred in a R matrix (rows are subjects, columns are blocks from 1 to 16). The dependent variable is the performance in each block ranging from 0 to 1 (continious). I want to fit the data to a logistic regression that has 3 parameters (steepness, max, infliction). Eventually I want to get a estimate for each parameter for each subject. Some of the subjects are in a control condition that I use as a baseline, so I can define outliers from that function by taking the 95 % credible intervall for each parameter from the controls ( basically I want to see which subjects don’t follow the control function).
I managed to do that in R using nloptr. Now I want to implement my code in STAN.
I am a total beginner using STAN, so it would be awesome if I get some help.

My data is stored in a matrix:

114 rows,14 columns
each row represents a subject and each column represents a block (1:14)
The columns shows the accuracy per block which is a float ranging from min of all blocks to 1 (depending on the subjects the min accuracy is >= 0.5).

I want to fit this function to the data:

Parameters:

inflection(ranging from 1-13, mean = 7)
steepness(ranging from 1-29, mean = 7)
max(ranging from 0-1, mean = 0.9)

(min does not need to be computed, it just depends on the subject)
I want to fit this function:
y = ((maxval[n]-min[n])(1/(1+exp(-steepness[n](x[n]-inflection[n])))) + min[n])

Finally I want to get estimates per subject for each parameter.

I tried the following code:

logSTAN.stan = "data {
  vector x[n];
  vector y[n];
  vector min[n];
}
parameters {
  int inflection;
  int steepness;
  real maxval;
}
model {
  inflection ~ normal(7, 13);
  steepness ~ normal(7, 29);
  max ~ normal(0.9, 1);
  y ~ ((maxval[n]-min[n])*(1/(1+exp(-steepness[n]*(x[n]-inflection[n])))) + 
   min[n]);
}"

I am not sure how to get my data stored in a matrix into the proper stan format to loop over all subjects.

I tried this for 1 subject only:

logModel.stan = " data {                          
   int x;
   real y;
}
parameters {
   real inflection;
   real steepness;
   real maxval;
}
model {
   inflection ~ normal(7, 13);
   steepness ~ normal(7, 29);
   maxval ~ normal(0.9, 1)
   y ~ (maxval-0.5)*(1/(1+exp(-steepness*(x-inflection)))) + 0.5;
}
"

However, it does not work because I don’t define a distribution. My dependent variable ranges from 0 to 1 and is a float, I tried a normal distribution but this does not work as STAN does not allow my function as an input. Any ideas?

If anything is unclear, I can add more infos.

Thank you so much.

Carina

Carina · December 9, 2019, 10:30am

Hello everyone, I fitted my model for one subject:

sub1data <- list(t = length(sub1), y = sub1)
logModel.stan = " functions{
      real logfun_log(vector t, real maxval, real steepness, real inflection)
      {vector[num_elements(t)]prob;
      real lprob;
      for (i in 1:num_elements(t)){
          prob[i] <- (maxval-0.5)*(1/(1+exp(-steepness*(t[i]-inflection)))) + 0.5;
      }
      lprob <- sum(log(prob));
      return lprob;
   }
}
data {                          
   int t;
   vector[t] y;
}
parameters {
   real<lower=7, upper=13>inflection;
   real<lower=7, upper=29>steepness;
   real<lower=0.9, upper=1> maxval;
}
model {
   y ~ logfun(maxval, steepness, inflection);
} 
"

It looks pretty good, I am still unsure about my priors and about specifying a distribution for my parameters. Any ideas would be very much appreciated. I am currently trying to fit all subjects into one model, my ideal output would be a matrix where I can see for each subject all of the three parameters.

Thank you again.

Carina

Max_Mantei · December 9, 2019, 11:13am

Hi Carina! Welcome to the Stan forums! :)

Are you able to share some example (or toy) data with us? I’d be easier to have a look into your questions with some data at hand.

Could you give a bit more context about what you are doing? What is performance? What are blocks? (Maybe in 2-3 sentences)

inflection(ranging from 1-13, mean = 7)
steepness(ranging from 1-29, mean = 7)
max(ranging from 0-1, mean = 0.9)

Are these constraints (“ranging from …”) binding? Or is it that, for example inflection is non-zero, with typical values between 1-13 and mean at around 7?

Also, in your Stan code, what’s the data? You switched from vector x[n]; vector y[n]; vector min[n]; to int x; real y; to int t; vector[t] y;. It’s a bit hard to tell what’s going on here without context.

I would also encourage you to write your Stan program in a separate *.stan file. Using RStudio with Stan files gives you syntax highlighting and a Check button, to (roughly) see whether the code is fine.

Cheers!
Max

Carina · December 9, 2019, 11:37am

Thank you, I will try that using a .stanfile.
I have a matrix with rows for subjects (114) and columns for blocks (14 blocks). Performance per block is what I am interested in and it therefore ranges between 0.5 and 1 (they can be either correct or incorrect on each trial, one block consists of 96 trials). I want to fit a function that is defined through steepness, maximum and minimum and a infliction point. Later on, I want to compare a linear vs. logistic vs. step function.
That’s how I stored my data:

          1         2         3         4     5         6         7     8     9        10    11    12 

1 0.5714286 0.5000000 0.8750000 0.5000000 0.500 0.7142857 0.6250000 0.875 0.750 0.8750000 0.875 0.750
2 0.7142857 0.5714286 0.4285714 0.3750000 0.500 0.5000000 0.6250000 0.500 0.500 0.6250000 1.000 1.000
3 0.5714286 0.4285714 0.4285714 0.3750000 0.375 0.7500000 1.0000000 1.000 1.000 1.0000000 1.000 1.000
4 0.7500000 0.1428571 0.5000000 0.5000000 0.125 0.5000000 0.5000000 0.625 0.375 0.5714286 0.500 0.750
5 0.5000000 0.4285714 0.5000000 0.6666667 0.375 0.3750000 0.7142857 0.750 0.500 0.8000000 0.750 0.625

     13            14
1 1.0000000 0.750
2 1.0000000 1.000
3 1.0000000 1.000
4 0.7500000 0.375
5 0.7142857 0.875

The constraints are typical values ranging between 1 and 13 (inflection), with a mean around 7.

I am trying to find the best fit for my data, later on I want to see which subjects are not in the 95 % credible intervall and analyze them further (some participants changed their response behavior during the task, so I am basically trying to find those switcher).

I hope everything is a bit clearer now.
Thank you for your help.

Carina

Max_Mantei · December 9, 2019, 12:27pm

Hey Carina!

Could you please upload your data as a .csv or something similar via the Upload button?

I’m afraid it’s not really clear to me, yet. Is this a Psychology or Behavioral Economics study?

Do you want to model peoples behavior over time (over blocks)?

Usually, Logistic-Regression refers to a regression of Bernoulli/or Binomial distributed dependent data (integer, not continuous), where the probability is parametrized by a Logit (or Logistic) function. Is this what you had in mind? I guess you rather want to model the probability (or in this case proportion) as a Logistic function over time (blocks), right?

Allow me to take a step back for a second. So, it seems (correct me if I’m wrong) that you have 114 individuals, who did 96 (trials) times 14 (blocks) = 1344 tasks (in total) each. Is that correct? It looks like the performance figures are multiples of 1/14. Is it 14 trials per block times 14 block equal to 96 tasks/trials per individual? Not sure I’m getting this right…

haha, sorry for all the questions. I think all this becomes clearer to me when I have a look a the data and can try your model ;)

Carina · December 9, 2019, 12:43pm

It’s psychology study (decision making) and yes I want to model behavior over time (blocks), so I want to model the proportion as a logistic function over time.
Again you are right, 114 subjects with performance levels for each block. Performance is number of correct trials per block/ number of trials per block.
I am sorry for not being more precise ;-).
I uploaded the data as csv. I hope that makes your life easier for now.
Thank you again.
Carina
toydata.csv (3.7 KB)

Max_Mantei · December 9, 2019, 1:02pm

What’s the number of trials per block? Is it constant across all blocks?

Carina · December 9, 2019, 1:10pm

Yes, its the same number of trials.

Get BlueMail for Android

Max_Mantei · December 9, 2019, 1:24pm

And how many trials are there per block?

I’m asking, because I want to reconstruct the number of correct trials per block per individual.

That would be performance * number of trials = number correct trials, right? And number correct trials should be an integer.

I ask, because I think it’d be better to model this as

\text{number correct trials}_t \sim \text{Binomial}(\text{number total trials}, \textit{performance}_t)

and then specify performance as the Logistic function (over time/blocks) as you have stated above.

Carina · December 9, 2019, 5:45pm

Thank you so much for your help Max. I see what you mean.
There are actually only 8 trials per block, I only analyze one coherence level in that specific dataset.

So it is 14 blocks consisting of 8 trials each.

Carina · December 9, 2019, 6:07pm

I just remembered that actually the reason why I used the performance per block is that I have some NA trials, meaning that sometimes there are less trials per block and thus the effective trialnumber per block is not the same. For example subject 1 has 4 correct in the first block and was too slow in one trial, so its only 4 out of 7 correct (thinking about it now, makes me wonder if that was a smart idea in the first place).
I uploaded the raw data of correct responses per block (again its 8 trials per block). I uploaded it for the first experiment, where we had 29 subjects.

Thank you again.

Carina

toydata.csv (1.02 KB)

martinmodrak · December 9, 2019, 6:22pm

Hi, me and @stemangiola solved a similar problem, you may find it useful to check out our reparametrization of the sigmoid function: Difficulties with logistic population growth model
I didn’t check the math very carefully, but it seems to me that the only (but possibly important) difference is that we have a parameter for “value of the sigmoid at midpoint of the data” instead of “maximal value of the sigmoid”. (the paper we speak about in the linked post is still in the publication purgatory, I’ll ask @stemangiola whether we could make a preprint available)

Also note, that you can use triple backticks to format chunks of code/program output/… I’ve edited the first post for you, you might want to edit the rest of the thread :-)

Hope that helps!

maxbiostat · December 9, 2019, 6:36pm

I’ve done it for one of the posts in this thread.

Max_Mantei · December 9, 2019, 11:06pm

Okay. Let’s start simple and work our way up.

The one huge simplification we are doing is that we assume everyone had 8 trials. I don’t know if that is the correct assumption. You said something that some people were slow and could have their 8th trial… I’d say probably assuming 8 possible trials is fair enough (could be wrong) because being able to move through trials quickly might be a part of “performance”. (I could be completely wrong here.) In any case, you could actually correct the code for different trial numbers.

Anyway, this is how I load the data.

library(tidyverse)

toydata <- read_csv("data/toydata.csv")

names(toydata) <- c("subject_id", paste0("block_", names(toydata)[2:15]))
toydata <- toydata %>% mutate(subject_id = 1:n())
data_mat <- toydata %>% select(starts_with("block")) %>% as.matrix()

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

data_list <- list(
  N = nrow(data_mat),
  T = ncol(data_mat),
  y = data_mat
)

Starting simple, here is a model that estimates average performance over subjects and over time. It’s not the most efficient code, but the data set is pretty small…

mod0.stan:

data{
  int<lower=1> N;
  int<lower=1> T;
  int<lower=0, upper=8> y[N, T];
}

transformed data{
  int<lower=1, upper=8> trials[N ,T]; // this could be fed in as data
  
  for(t in 1:T)
    for(n in 1:N)
      trials[n, t] = 8;
  
}

parameters{
  real<lower=0,upper=1> performance;
}

model{
  
  for(n in 1:N)
    y[n] ~ binomial(trials[n], performance); // vectorize over T
    
}

The result is kind of the grand mean… not particularly interesting…

> mod0 <- stan(file = "model/mod0.stan", data = data_list)
> mod0
Inference for Stan model: mod0.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
performance     0.55    0.00 0.01     0.53     0.54     0.55     0.55     0.56  1691    1
lp__        -2239.53    0.02 0.70 -2241.47 -2239.70 -2239.28 -2239.10 -2239.05  1682    1

Now, let’s estimate performance for each subject separately.

mod1.stan:

data{
  int<lower=1> N;
  int<lower=1> T;
  int<lower=0, upper=8> y[N, T];
}

transformed data{
  int<lower=1, upper=8> trials[N ,T]; // could be data
  
  for(t in 1:T)
    for(n in 1:N)
      trials[n, t] = 8;
  
}

parameters{
  real<lower=0,upper=1> performance[N];
}

model{
  
  for(n in 1:N)
    y[n] ~ binomial(trials[n], performance[n]); // again, vectorize over T
    
}

Let’s plot the results:

Tiles are subject (I numbered them differently to account for the one missing, see R code above), points are (sort of) the performance data points that you have in your data set (but assuming 8 trials for every block), and the blue lines/areas are the posterior probabilities of the time/block-constant, subject-varying performance means.

Now, let’s vary performance by all subject-block combinations.

mode2.stan:

data{
  int<lower=1> N;
  int<lower=1> T;
  int<lower=0, upper=8> y[N, T];
}

transformed data{
  int<lower=1, upper=8> trials[N ,T];
  
  for(t in 1:T)
    for(n in 1:N)
      trials[n, t] = 8;
  
}

parameters{
  real<lower=0,upper=1> performance[N, T];
}

model{
  
  for(n in 1:N)
    for(t in 1:T)
      y[n, t] ~ binomial(trials[n, t], performance[n, t]);
  
}

This kind of replicates the data, but there is no real structure. Note, that if the number of trials would vary, you’s see slightly greater variance for those observations with less trials, as they “contain less information”.

Okay… Now, to your model. Actually, I had difficulties to fit the function that you have proposed… I set the minval to 0 and that worked out fine…

´´mode3.stan``:

data{
  int<lower=1> N;
  int<lower=1> T;
  int<lower=0, upper=8> y[N, T];
}

transformed data{
  int<lower=1, upper=8> trials[N ,T];
  
  for(t in 1:T)
    for(n in 1:N)
      trials[n, t] = 8;
  
}

parameters{
  // all parameters vary by subject
  real<lower=0, upper=1> maxval[N];
  real<lower=0> steepness[N];
  real<lower=0> inflection[N];
}

transformed parameters{
  real<lower=0, upper=1> performance[N, T];
  
  // specefy performance as a logistic function
  for(n in 1:N)
    for(t in 1:T)
      performance[n, t] = maxval[n]*(1/(1+exp(-steepness[n]*(t-inflection[n]))));
}

model{
  
  for(n in 1:N)
    for(t in 1:T)
      y[n, t] ~ binomial(trials[n, t], performance[n, t]);
  
  // these are some rough priors... 
  // I got those from your suggestions and eyeballing...
  maxval ~ beta(9, 1);
  steepness ~ exponential(7); // this is on a much tighter scale than your guess in the first post
  inflection ~ gamma(7, 1);
  
}

This is the result as a plot:

From here you can further develop the model. Like incorporating the different trial numbers, reformulate the logistic function (minval?), setting more reasonable priors, or making the model hierarchical…

Hope this gives you a good starting point.

Cheers!
Max

PS: Tell me if you need the code for the plots.

EDIT:
I also quickly did the model with (estimated) min value:

data{
  int<lower=1> N;
  int<lower=1> T;
  int<lower=0, upper=8> y[N, T];
}

transformed data{
  int<lower=1, upper=8> trials[N ,T];
  
  for(t in 1:T)
    for(n in 1:N)
      trials[n, t] = 8;
  
}

parameters{
  real<lower=0, upper=1> minval[N];
  real<lower=0, upper=1> maxval[N];
  real<lower=0> steepness[N];
  real<lower=0> inflection[N];
}

transformed parameters{
  real<lower=0, upper=1> performance[N, T];
  
  for(n in 1:N)
    for(t in 1:T)
      performance[n, t] = (maxval[n] - minval[n])*(1/(1+exp(-steepness[n]*(t-inflection[n])))) + minval[n];
}

model{
  
  for(n in 1:N)
    for(t in 1:T)
      y[n, t] ~ binomial(trials[n, t], performance[n, t]);

  minval ~ beta(1, 9);
  maxval ~ beta(9, 1);
  steepness ~ exponential(7);
  inflection ~ gamma(7, 1);
  
}

PPS: Looks like subject_id 21 to 29 are just replicates of subject_id 2 to 10…

Carina · December 10, 2019, 9:30am

Thank you so much Max, it is way clearer to me now how to define my model using STAN syntax. I will explore a few more ideas, especially the hierarchical model might be interesting.
I tried to get your (very pretty) plots, but not sure how the access the stan file data. It would be great if you could share your code.

Thank you again.

Carina

Max_Mantei · December 10, 2019, 1:03pm

Sure. Here’s the project folder. Unzip it and put it somewhere. I’d suggest double clicking on the psych-logistic.Rproj file and then opening script.R. Everything from loading data and running the models to plotting the results is in the script.R file.

Ah… I forgot that I cant upload .zip files here. I uploaded everything to github. Just follow this link. If you don’t use github you can just click Clone or download and then Download ZIP and then proceed as explained above.

Carina · December 17, 2019, 10:41am

Hello STAN community, I would like to compare different models using the loo package. I tried to generate data and get the log likelihood for my model. I tried something like this:

generated quantities {
//likelihood
 int log_lik [N,T];
    for(n in 1:N)
    for(t in 1:T)
      log_lik[n, t] = binomial_lpmf(y[n,t] | trials[n, t], performance[n, t]);
}

But I am not sure if I am doing the correct thing. I also wanted to generate some random data to see how good my model predicts data.

Any help is very much appreciated.
Thank you so much.
Carina

erognli · December 17, 2019, 12:50pm

Hello, Carina.

You should declare log_lik as real, I think, as the output of the binomial lpmf is real-valued. But, the way you are doing it now, you are evaluating the log-likelihood of each observed trial conditional on the prediction of your model. If you feed that to loo, I think you’ll get leave-one-trial-out. I’m guessing you want to evaluate how well your model will predict the performance of a new participant across trials, not the result of a single new trial from one of the participants in the study?

What I think you should be doing (I trust someone will correct me if I’m giving you bad advice now) is to use log_sum_exp to sum the log-likelihoods of each participants trials, and then use that in loo. I would make a vector called log_lik, and then do the rest within the loop with a different name. Something like this:

generated quantities {
  real log_lik[N];
  int trials_pred[N,T];

    for(n in 1:N) {
      real trials_log_lik[T];
      for(t in 1:T) {
        trials_log_lik[t] = binomial_lpmf(y[n,t] | trials[n, t], performance[n, t]);
        trials_pred[n,t] = binomial_rng(trials[n,t], performance[n,t]);
      }
      log_lik[n] = sum(trials_log_lik);
    }
}

I think that would give you leave-one-participant-out (if that is what you want).

Edit: Added a bit of code to generate draws from the posterior, as per your final question. I think it could be vectorised, but perhaps it’s not a big deal as there is already a loop.

Second edit: Removed error pointed out by @martinmodrak, so as not to leave bad code to confuse others. Note, his implementation is more efficient.

Carina · December 17, 2019, 2:03pm

Thank you. My loo output gives me the following values:

Computed from 8000 by 114 log-likelihood matrix

         Estimate  SE
elpd_loo    133.3 3.5
p_loo         0.9 0.1
looic      -266.7 7.0

I guess the looic behaves as the aic, and thus is in itself not informative? My ppc looks pretty weird:

I plotted 2000 draws and the fit for the first subject. My model is not really capturing what is going on.

martinmodrak · December 17, 2019, 2:38pm

Thanks for being helpful, the overall idea is sensible, but the execution is unfortunately incorrect - to have leave-one-participant-out cross validation, you sum the individual log-likelihoods directly. This actually could lead to a more compact code, as you can then have (ignoring generating predictions):

generated quantities {
  real log_lik[N];

    for(n in 1:N) {      
      log_lik[n] = binomial_lpmf(y[n,] | trials[n, ], performance[n, ]);
    }
}

Some more discussion at:

I think (not 100% sure) that elpd_loo can be a somewhat meaningful quantity (it is the “expected log predictive density”), but generally I find interpretation without another model to compare to challenging. PPC’s are in my experience much more useful to asses fit of a single model.

Hope that helps!

Topic		Replies	Views
Efficient implementation of the for loop? Modeling rstan , fitting-issues	1	596	February 18, 2022
A matrix of parameters in Stan Integral Modeling fitting-issues	41	1566	November 25, 2019
Log binomial random intercept model Modeling fitting-issues	8	2705	March 4, 2018
Logistic Regression in Stan Modeling logistic-regression	2	587	January 4, 2021
Simple Multinomial Logistic Regression Performance Modeling performance	5	1006	September 30, 2020

Trying to fit my costumised logistic function

Related topics