I’m trying to learn parallelization and I am using a small simulated dataset for model development right now. For the first set of parameter values that I chose as representative of my actual problem I find that the parallelized version of the model is much slower than the standard version, because it reaches max treedepth more often. Then, changing the effect size in the simulated problem made parallelization the faster option.
I’m wondering two things: why I’m getting more max treedepth iterations in the parallelized model, and whether there’s some rules of thumb of when parallelization will be useful for a problem of this type (as in, what characteristics of the dataset lead to higher gains from parallelizing).
The model is a hierarchical regression model. Our levels are different neurons, we measure their response under two different conditions, Stimulation=0 and Stimulation=1, and repeat the experiment a number of trials. My dataset looks like this:
Neuron Trial Stim Response
0 1 0 15.627608
0 2 0 8.079750
... ... ... ...
199 49 1 13.033880
199 50 1 12.602682
The parameters to generate the dataset are the following (see the code at the end)
- nNeurons = 200 (number of levels)
- nTrials = 50 (number of repeated measures)
- meanBaseline = 10 (mean intercept)
- sigmaBaseline = 3 (sd intercept between neurons)
- meanEffect = 0.5 (mean effect of stimulation on the response)
- sigmaEffect = 0.1 (sd of effect between neurons)
- sigmaResidual = 3 (sd of the response, across trials of same neuron and condition)
I fit a hierarchical model to this data, with varying slopes and intercepts. I parallelized the model following the Stan users guide map-rect, and am testing the parallelization using a single chain. However, with the parameters above the standard version completed sampling in 80 s and the parallelized version took 500 s with 4 threads (the 4 threads are being used according to the computer). Essentially, while the standard version had only a few max treedepth iterations, the parallelized version reached max treedepth in all iterations.
When I change the parameters of the dataset to be meanEffect = 4, sigmaResidual = 1, sigmaEffect = 0.5, leading to a higher signal/noise ratio, the standard and parallelized models take 100 and 40 seconds respectively, and no max treedepth iterations are obtained in either.
Thus, the questions are, why do I get so many max treedepth iterations with my parallelized version of the model? And with a problem such as the one described above, when can I expect to benefit more from parallelization (e.g. when I have more levels? when I have more trials? etc).
This is the standard model version:
data {
int<lower=1> nNeurons; // Number of groups
int<lower=1> N; // Number of total datapoints
vector[N] Stimulation; // Stimulation indicator
vector[N] Response; // Response data
array[N] int<lower=1> neuronId; // Neuron indices for each observation
}
parameters {
real meanBaseline; // Average baseline FR
real meanEffect; // Increment of FR with stimulation
vector[nNeurons] neuronBaseline; // Neuron-specific baseline FR
vector[nNeurons] neuronEffect; // Neuron-specific effect of stimulation
real<lower=0> sigmaBaseline; // Standard deviation of baseline FR
real<lower=0> sigmaEffect; // Standard deviation of stimulation effect
real<lower=0> sigmaResidual; // Standard deviation of residuals
}
model {
// Priors
meanBaseline ~ normal(0, 30);
sigmaBaseline ~ normal(0, 5);
neuronBaseline ~ normal(meanBaseline, sigmaBaseline);
meanEffect ~ normal(0, 30);
sigmaEffect ~ normal(0, 5);
neuronEffect ~ normal(meanEffect, sigmaEffect);
sigmaResidual ~ normal(0, 5);
vector[N] mu = neuronBaseline[neuronId] + Stimulation .* neuronEffect[neuronId];
Response ~ normal(mu, sigmaResidual);
}
This is the parallelized model version:
functions {
vector partial_sum(vector popParams, vector neuronParams, real[] xr, int[] xi) {
// Unpack population-level parameters
real meanBaseline = popParams[1];
real meanEffect = popParams[2];
real sigmaBaseline = popParams[3];
real sigmaEffect = popParams[4];
real sigmaResidual = popParams[5];
// Unpack neuron-level parameters
real neuronBaseline = neuronParams[1];
real neuronEffect = neuronParams[2];
// Compute log-likelihood of neuron params
real lp1 = normal_lpdf(neuronBaseline | meanBaseline, sigmaBaseline);
real lp2 = normal_lpdf(neuronEffect | meanEffect, sigmaEffect);
// Compute log-likelihood of data
real ll = normal_lpdf(xr | neuronBaseline + neuronEffect * to_vector(xi), sigmaResidual);
return [lp1 + lp2 + ll]';
}
}
data {
int<lower=1> nNeurons; // Number of groups
int<lower=1> N; // Number of total datapoints
array[N] int<lower=0, upper=1> Stimulation; // Stimulation indicator
array[N] real Response; // Response data
array[N] int<lower=1> neuronId; // Neuron indices for each observation
}
transformed data {
int<lower=0> J = N / nNeurons;
array[nNeurons, J] real xr;
array[nNeurons, J] int<lower=0, upper=1> xi;
{
int pos = 1;
for (i in 1:nNeurons) {
int end = pos + J - 1;
xr[i] = to_array_1d(Response[pos:end]);
xi[i] = Stimulation[pos:end];
pos = pos + J;
}
}
}
parameters {
real meanBaseline; // Average meanBaseline
real meanEffect; // Increment of with stimulation
vector[nNeurons] neuronBaseline; // Neuron-specific meanBaseline
vector[nNeurons] neuronEffect; // Neuron-specific effect of stimulation
real<lower=0> sigmaBaseline; // Standard deviation of meanBaseline
real<lower=0> sigmaEffect; // Standard deviation of stimulation effect
real<lower=0> sigmaResidual; // Standard deviation of residuals
}
transformed parameters {
vector[5] popParams;
array[nNeurons] vector [2] neuronParams;
for (i in 1:nNeurons) {
neuronParams[i] = [neuronBaseline[i], neuronEffect[i]]';
}
popParams = [meanBaseline, meanEffect, sigmaBaseline, sigmaEffect, sigmaResidual]';
}
model {
// Priors
meanBaseline ~ normal(0, 30);
sigmaBaseline ~ normal(0, 5);
neuronBaseline ~ normal(meanBaseline, sigmaBaseline);
meanEffect ~ normal(0, 30);
sigmaEffect ~ normal(0, 5);
neuronEffect ~ normal(meanEffect, sigmaEffect);
sigmaResidual ~ normal(0, 5);
// Likelihood
target += sum(map_rect(partial_sum, popParams, neuronParams, xr, xi));
}
And the python code to generate the dataset and fit the model:
import numpy as np
import pandas as pd
from cmdstanpy import CmdStanModel
import os
import time
###############
# 1) GENERATE THE DATA
###############
# Dataset parameters
nTrials = 50 # number of trials per condition
nNeurons = 200 # number of neurons connected to the target
meanBaseline = 10 # mean baseline firing rate
sigmaBaseline = 3 # variability in baseline firing rate
meanEffect = 0.5
sigmaEffect = 0.1
sigmaResidual = 3
neuronBaselines = np.random.normal(0, sigmaBaseline, nNeurons) + \
meanBaseline
neuronEffects = np.random.normal(0, sigmaEffect, nNeurons) + \
meanEffect
# Generate the dataset
data = []
for neuron in range(nNeurons):
for stimulation in [0, 1]:
samples = np.random.normal(neuronBaselines[neuron] +
neuronEffects[neuron] * stimulation,
sigmaResidual, nTrials)
data.extend(zip([neuron] * nTrials,
range(1, nTrials + 1),
[stimulation] * nTrials, samples))
# Create a DataFrame from the generated data
expData = pd.DataFrame(data, columns=['Neuron', 'Trial',
'Stim', 'Response'])
###############
# 2) FIT MODELS
###############
nChains = 1
# Prepare the data
stan_data = {
'nNeurons': nNeurons,
'N': nNeurons * nTrials * 2,
'Stimulation': expData['Stim'].values,
'Response': expData['Response'].values,
'neuronId': expData['Neuron'].values + 1,
}
### FIT STANDARD MODEL
stan_file = './stan_models/standard.stan'
model = CmdStanModel(stan_file=stan_file)
start = time.time()
posterior = model.sample(data=stan_data, chains=1)
end = time.time()
print(f'Elapsed with loop {end - start}')
### FIT PARALLELIZED MODEL
threads = 4
stan_file = './stan_models/parallel.stan'
model = CmdStanModel(stan_file=stan_file,
cpp_options={'STAN_THREADS': True})
os.environ['STAN_NUM_THREADS'] = str(threads)
start = time.time()
posterior = model.sample(data=stan_data, chains=1,
threads_per_chain=threads, show_progress=True)
end = time.time()
print(f'Elapsed {end - start}')