Getting max treedepth iterations for parallelized but not standard model

I’m trying to learn parallelization and I am using a small simulated dataset for model development right now. For the first set of parameter values that I chose as representative of my actual problem I find that the parallelized version of the model is much slower than the standard version, because it reaches max treedepth more often. Then, changing the effect size in the simulated problem made parallelization the faster option.

I’m wondering two things: why I’m getting more max treedepth iterations in the parallelized model, and whether there’s some rules of thumb of when parallelization will be useful for a problem of this type (as in, what characteristics of the dataset lead to higher gains from parallelizing).

The model is a hierarchical regression model. Our levels are different neurons, we measure their response under two different conditions, Stimulation=0 and Stimulation=1, and repeat the experiment a number of trials. My dataset looks like this:

Neuron     Trial    Stim    Response
     0       1        0     15.627608
     0       2        0      8.079750
   ...      ...      ...        ...      
   199      49        1     13.033880
   199      50        1     12.602682

The parameters to generate the dataset are the following (see the code at the end)

  • nNeurons = 200 (number of levels)
  • nTrials = 50 (number of repeated measures)
  • meanBaseline = 10 (mean intercept)
  • sigmaBaseline = 3 (sd intercept between neurons)
  • meanEffect = 0.5 (mean effect of stimulation on the response)
  • sigmaEffect = 0.1 (sd of effect between neurons)
  • sigmaResidual = 3 (sd of the response, across trials of same neuron and condition)

I fit a hierarchical model to this data, with varying slopes and intercepts. I parallelized the model following the Stan users guide map-rect, and am testing the parallelization using a single chain. However, with the parameters above the standard version completed sampling in 80 s and the parallelized version took 500 s with 4 threads (the 4 threads are being used according to the computer). Essentially, while the standard version had only a few max treedepth iterations, the parallelized version reached max treedepth in all iterations.

When I change the parameters of the dataset to be meanEffect = 4, sigmaResidual = 1, sigmaEffect = 0.5, leading to a higher signal/noise ratio, the standard and parallelized models take 100 and 40 seconds respectively, and no max treedepth iterations are obtained in either.

Thus, the questions are, why do I get so many max treedepth iterations with my parallelized version of the model? And with a problem such as the one described above, when can I expect to benefit more from parallelization (e.g. when I have more levels? when I have more trials? etc).

This is the standard model version:

data {
  int<lower=1> nNeurons;  // Number of groups
  int<lower=1> N; // Number of total datapoints
  vector[N] Stimulation; // Stimulation indicator
  vector[N] Response; // Response data
  array[N] int<lower=1> neuronId; // Neuron indices for each observation
}

parameters {
  real meanBaseline;      // Average baseline FR
  real meanEffect;      // Increment of FR with stimulation
  vector[nNeurons] neuronBaseline;  // Neuron-specific baseline FR
  vector[nNeurons] neuronEffect;  // Neuron-specific effect of stimulation
  real<lower=0> sigmaBaseline;     // Standard deviation of baseline FR
  real<lower=0> sigmaEffect;     // Standard deviation of stimulation effect
  real<lower=0> sigmaResidual;  // Standard deviation of residuals
}

model {
  // Priors
  meanBaseline ~ normal(0, 30);
  sigmaBaseline ~ normal(0, 5);
  neuronBaseline ~ normal(meanBaseline, sigmaBaseline);
  meanEffect ~ normal(0, 30);
  sigmaEffect ~ normal(0, 5);
  neuronEffect ~ normal(meanEffect, sigmaEffect);
  sigmaResidual ~ normal(0, 5);
  vector[N] mu = neuronBaseline[neuronId] + Stimulation .* neuronEffect[neuronId];
  Response ~ normal(mu, sigmaResidual);
}

This is the parallelized model version:

functions {
  vector partial_sum(vector popParams, vector neuronParams, real[] xr, int[] xi) {
    // Unpack population-level parameters
    real meanBaseline = popParams[1];
    real meanEffect = popParams[2];
    real sigmaBaseline = popParams[3];
    real sigmaEffect = popParams[4];
    real sigmaResidual = popParams[5];
    // Unpack neuron-level parameters
    real neuronBaseline = neuronParams[1];
    real neuronEffect = neuronParams[2];
    // Compute log-likelihood of neuron params
    real lp1 = normal_lpdf(neuronBaseline | meanBaseline, sigmaBaseline);
    real lp2 = normal_lpdf(neuronEffect | meanEffect, sigmaEffect);
    // Compute log-likelihood of data
    real ll = normal_lpdf(xr | neuronBaseline + neuronEffect * to_vector(xi), sigmaResidual);
    return [lp1 + lp2 + ll]';
  }
}

data {
  int<lower=1> nNeurons;  // Number of groups
  int<lower=1> N; // Number of total datapoints
  array[N] int<lower=0, upper=1> Stimulation; // Stimulation indicator
  array[N] real Response; // Response data
  array[N] int<lower=1> neuronId; // Neuron indices for each observation
}

transformed data {
  int<lower=0> J = N / nNeurons;
  array[nNeurons, J] real xr;
  array[nNeurons, J] int<lower=0, upper=1> xi;
  {
    int pos = 1;
    for (i in 1:nNeurons) {
        int end = pos + J - 1;
        xr[i] = to_array_1d(Response[pos:end]);
        xi[i] = Stimulation[pos:end];
        pos = pos + J;
    }
  }
}

parameters {
  real meanBaseline;      // Average meanBaseline 
  real meanEffect;      // Increment of  with stimulation
  vector[nNeurons] neuronBaseline;  // Neuron-specific meanBaseline 
  vector[nNeurons] neuronEffect;  // Neuron-specific effect of stimulation
  real<lower=0> sigmaBaseline;     // Standard deviation of meanBaseline 
  real<lower=0> sigmaEffect;     // Standard deviation of stimulation effect
  real<lower=0> sigmaResidual;  // Standard deviation of residuals
}

transformed parameters {
  vector[5] popParams;
  array[nNeurons] vector [2] neuronParams;
  for (i in 1:nNeurons) {
    neuronParams[i] = [neuronBaseline[i], neuronEffect[i]]';
  }
  popParams = [meanBaseline, meanEffect, sigmaBaseline, sigmaEffect, sigmaResidual]';
}

model {
  // Priors
  meanBaseline ~ normal(0, 30);
  sigmaBaseline ~ normal(0, 5);
  neuronBaseline ~ normal(meanBaseline, sigmaBaseline);
  meanEffect ~ normal(0, 30);
  sigmaEffect ~ normal(0, 5);
  neuronEffect ~ normal(meanEffect, sigmaEffect);
  sigmaResidual ~ normal(0, 5);
  // Likelihood
  target += sum(map_rect(partial_sum, popParams, neuronParams, xr, xi));
}

And the python code to generate the dataset and fit the model:

import numpy as np
import pandas as pd
from cmdstanpy import CmdStanModel
import os
import time

###############
# 1) GENERATE THE DATA
###############

# Dataset parameters
nTrials = 50  # number of trials per condition
nNeurons = 200  # number of neurons connected to the target
meanBaseline = 10  # mean baseline firing rate
sigmaBaseline = 3  # variability in baseline firing rate
meanEffect = 0.5
sigmaEffect = 0.1
sigmaResidual = 3

neuronBaselines = np.random.normal(0, sigmaBaseline, nNeurons) + \
  meanBaseline
neuronEffects = np.random.normal(0, sigmaEffect, nNeurons) + \
  meanEffect

# Generate the dataset
data = []
for neuron in range(nNeurons):
    for stimulation in [0, 1]:
        samples = np.random.normal(neuronBaselines[neuron] +
                                   neuronEffects[neuron] * stimulation,
                                   sigmaResidual, nTrials)
        data.extend(zip([neuron] * nTrials,
                        range(1, nTrials + 1),
                        [stimulation] * nTrials, samples))

# Create a DataFrame from the generated data
expData = pd.DataFrame(data, columns=['Neuron', 'Trial',
                                      'Stim', 'Response'])


###############
# 2) FIT MODELS
###############

nChains = 1

# Prepare the data
stan_data = {
    'nNeurons': nNeurons,
    'N': nNeurons * nTrials * 2,
    'Stimulation': expData['Stim'].values,
    'Response': expData['Response'].values,
    'neuronId': expData['Neuron'].values + 1,
}

### FIT STANDARD MODEL
stan_file = './stan_models/standard.stan'
model = CmdStanModel(stan_file=stan_file)
start = time.time()
posterior = model.sample(data=stan_data, chains=1)
end = time.time()
print(f'Elapsed with loop {end - start}')

### FIT PARALLELIZED MODEL
threads = 4
stan_file = './stan_models/parallel.stan'
model = CmdStanModel(stan_file=stan_file,
                     cpp_options={'STAN_THREADS': True})
os.environ['STAN_NUM_THREADS'] = str(threads)
start = time.time()
posterior = model.sample(data=stan_data, chains=1,
                         threads_per_chain=threads, show_progress=True)
end = time.time()
print(f'Elapsed {end - start}')


Parallelization is just a different way to compute the same log-likelihood and should not affect treedepth.
One difference I see is that your parallelized model applies the hierarchical priors for neuronBaseline and neuronEffect twice–once in the model block and then again inside partial_sum.

By the way, I’d rather use reduce_sum() over map_rect() in this model.

2 Likes

Thanks!! I had missed that (it’s my first time using map-rect). Now the parallelized version doesn’t have the max treedepth iterations and is actually much faster than the standard model (80 s vs 20 s).

If you have time, would you mind sharing why you’d prefer reduce_sum than map-rect for this model? I also implemented a reduce_sum version but it seemed to me that map-rect would be better because it splits the sum into same-level chunks? But reduce_sum seems much easier to implement so I’d rather use that when possible.

Below is my implementation using reduce_sum, which is also slower than my standard model (and much slower than the map-rect now). When I set grainsize =1 it barely moves, so I used grainsize=5000, 2000, 10000, and in all cases is slower than the standard model (total N is 20000 and I’m using 4 threads).

functions {
  real partial_sum(array[] real Response_slice, int start, int end, vector mu, 
                   real sigmaResidual) {
    return normal_lpdf(Response_slice | mu[start:end], sigmaResidual);
  }
}

data {
  int<lower=1> nNeurons;  // Number of groups
  int<lower=1> N; // Number of total datapoints
  vector[N] Stimulation; // Stimulation indicator
  array[N] real Response; // Response data
  array[N] int<lower=1> neuronId; // Neuron indices for each observation
}

parameters {
  real meanBaseline;      // Average baseline FR
  real meanStimEffect;      // Increment of FR with stimulation
  vector[nNeurons] neuronBaseline;  // Neuron-specific baseline FR
  vector[nNeurons] neuronEffect;  // Neuron-specific effect of stimulation
  real<lower=0> sigmaBaseline;     // Standard deviation of baseline FR
  real<lower=0> sigmaEffect;     // Standard deviation of stimulation effect
  real<lower=0> sigmaResidual;  // Standard deviation of residuals
}

model {
  // Priors
  meanBaseline ~ normal(0, 30);
  sigmaBaseline ~ normal(0, 5);
  neuronBaseline ~ normal(meanBaseline, sigmaBaseline);
  meanStimEffect ~ normal(0, 30);
  sigmaEffect ~ normal(0, 5);
  neuronEffect ~ normal(meanStimEffect, sigmaEffect);
  sigmaResidual ~ normal(0, 5);
  vector[N] mu = neuronBaseline[neuronId] + Stimulation .* neuronEffect[neuronId];
  int grainsize = 5000; // Number of observations per thread
  target += reduce_sum(partial_sum, Response, grainsize, mu, sigmaResidual);
}

In case it’s of use to anyone, following these other posts 1 2 I modified the reduce_sum code as follows, to compute the linear predictor inside reduce_sum, and the model is now on par with map-rect, with much less coding complexity:

functions {
  real partial_sum(array[] real Response_slice, int start, int end, vector Stimulation,
                   array[] int neuronId, vector neuronBaseline, vector neuronEffect,
                   real sigmaResidual) {
      vector[end-start+1] mu = neuronBaseline[neuronId[start:end]] +
        Stimulation[start:end] .* neuronEffect[neuronId[start:end]];
    return normal_lpdf(Response_slice | mu, sigmaResidual);
  }
}

data {
  int<lower=1> nNeurons;  // Number of groups
  int<lower=1> N; // Number of total datapoints
  vector[N] Stimulation; // Stimulation indicator
  array[N] real Response; // Response data
  array[N] int<lower=1> neuronId; // Neuron indices for each observation
  int grainsize; // Number of observations per thread
}

parameters {
  real meanBaseline;      // Average baseline FR
  real meanStimEffect;      // Increment of FR with stimulation
  vector[nNeurons] neuronBaseline;  // Neuron-specific baseline FR
  vector[nNeurons] neuronEffect;  // Neuron-specific effect of stimulation
  real<lower=0> sigmaBaseline;     // Standard deviation of baseline FR
  real<lower=0> sigmaEffect;     // Standard deviation of stimulation effect
  real<lower=0> sigmaResidual;  // Standard deviation of residuals
}

model {
  // Priors
  meanBaseline ~ normal(0, 30);
  sigmaBaseline ~ normal(0, 5);
  neuronBaseline ~ normal(meanBaseline, sigmaBaseline);
  meanStimEffect ~ normal(0, 30);
  sigmaEffect ~ normal(0, 5);
  neuronEffect ~ normal(meanStimEffect, sigmaEffect);
  sigmaResidual ~ normal(0, 5);
  target += reduce_sum(partial_sum, Response, grainsize, Stimulation, neuronId,
    neuronBaseline, neuronEffect, sigmaResidual);
}

3 Likes

I think that in this case compounding the expression for mu may improve things further:

functions {
  real partial_sum(array[] real Response_slice, int start, int end, vector Stimulation,
                   array[] int neuronId, vector neuronBaseline, vector neuronEffect,
                   real sigmaResidual) {
    //return normal_lpdf(Response_slice | Stimulation[start:end] .* neuronEffect[neuronId[start:end]], sigmaResidual);
  // also prefer the "lupdf" if you do not need the normalisation
return normal_lupdf(Response_slice | Stimulation[start:end] .* neuronEffect[neuronId[start:end]], sigmaResidual);
  }
}

This way magic eigen expressions may be used and avoid a few copies of the processed variables.

The reason why this logic is better is that we then form these sub-expressions within the partial sum context. The partial sums get evaluated each on their own AD tape such that instead of having one huge AD tape you get many small ones. The small ones fit better in the CPU cache leading to better memory locality. At least this is my understanding… this was a nice by-product of reduce_sum.

2 Likes

Just a quick comment about _lupdfs because they’re quite finicky.
The form you’ve posted does not compile because the surrounding function name must end in _lpdf for unnormalized functions to be available:

Semantic error in 'example.stan', line 7, column 7 to column 112:
   -------------------------------------------------
     5:      //return normal_lpdf(Response_slice | Stimulation[start:end] .* neuronEffect[neuronId[start:end]], sigmaResidual);
     6:    // also prefer the "lupdf" if you do not need the normalisation
     7:  return normal_lupdf(Response_slice | Stimulation[start:end] .* neuronEffect[neuronId[start:end]], sigmaResidual);
                ^
     8:    }
     9:  }
   -------------------------------------------------

Functions with names ending in _lupdf and _lupmf can only be used in the model block or user-defined functions with names ending in _lpdf or _lpmf.

Furthermore, for the _lupdf to actually have an effect you must also use _lupdf suffix with reduce_sum too; otherwise the compiler just generates a call with normalization constants anyway.

functions {
  real partial_sum_lpdf(array[] real Response_slice, int start, int end, vector Stimulation,
//                 ^^^^ NOTE: suffix here
                   array[] int neuronId, vector neuronBaseline, vector neuronEffect,
                   real sigmaResidual) {
return normal_lupdf(Response_slice | Stimulation[start:end] .* neuronEffect[neuronId[start:end]], sigmaResidual);
  }
}
...
model {
  ...
  target += reduce_sum(partial_sum_lupdf, Response, grainsize, Stimulation, neuronId,
//                                 ^^^^^ NOTE: suffix here
    neuronBaseline, neuronEffect, sigmaResidual);
}

_lupdf vs _lpdf does not make much difference for normal distribution but can be significant for things like gamma or binomial distribution.

3 Likes