Adding coordinates and imd to arviz inference object after cmdstanpy

I’m trying to add coordinates and dimensions to an arviz inference.data object after sampling with cmdstanpy.

I need to do this in order to be able to subset the samples easily and feed them to arviz posterior plots and predictive checks.

I can successfully add one single dimension per group but no more than that. Here is some code to reproduce the problem. I’m using the eight school example: imagine that each school belongs to district and implements a certain curriculum, both indexed as either 1 or 2.

I want to be able to slice the inference.data object according to either district or curriculum.

Starting with code to run a model

from cmdstanpy import CmdStanModel
import numpy as np
import arviz as az
import pandas as pd

# Model specification and compilation
model_code = """
    data {
        int<lower=0> N;
        vector[N] y;
    }
    parameters {
        real<lower=0> theta;
    }
    model {
        theta ~ normal(0, 1); 
        y ~ normal(theta, 1);
    }
    generated quantities{
       vector[N] log_lik;
       vector[N] y_hat;
        for (j in 1:N) {
            log_lik[j] = normal_lpdf(y[j] | theta, 1);
            y_hat[j] = normal_rng(theta, 1);

        }
    }
"""
with open("./eight_school_check.stan", "w") as f:
    print(model_code, file=f)
stan_file = "./eight_school_check.stan"
stan_model = CmdStanModel(
    stan_file=stan_file,
    cpp_options={'STAN_THREADS': 'TRUE'}
)
stan_model.compile()

# Data
eight_school_data = {
    "N": 8,
    "y": [28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0]
}

# Model fit
stan_fit = stan_model.sample(data=eight_school_data)

Then, if I try to convert the fitted results and add coordinates and dimensions

additional_indexes = {
    "ind_district" : np.array([0, 0, 0, 1, 1, 1, 0, 1]),
    "ind_curriculum" : np.array([0, 1, 0, 1, 0, 1, 0, 1]),
}

inference_data = az.from_cmdstanpy(
    posterior=stan_fit,
    posterior_predictive='y_hat',
    observed_data={'y' : eight_school_data['y']},
    log_likelihood='log_lik',
    coords={
        'district': additional_indexes['ind_district'],
        'curriculum': additional_indexes['ind_curriculum'], 
    },
    dims={
        'y' :  ['district', 'curriculum'],
        'y_hat' : ['district', 'curriculum'],
        'log_lik' : ['district', 'curriculum'], 
    }
)

which gives me the following error

different number of dimensions on data and dims: 3 vs 4

However, I am able to add one of the two dimensions using:

inference_data = az.from_cmdstanpy(
    posterior=stan_fit,
    posterior_predictive='y_hat',
    observed_data={'y' : eight_school_data['y']},
    log_likelihood='log_lik',
    coords={
        'district': additional_indexes['ind_district'],
        'curriculum': additional_indexes['ind_curriculum'], 
    },
    dims={
        'y' :  ['district'],
        'y_hat' : ['curriculum'],
        'log_lik' : ['district'], 
    }
)

this runs with no error, and I can successfully run commands like

inference_data.sel(distict=[1]) # See coords under posterior_predictive
inference_data.sel(curriculum=[1]) # See coords under log_likelihood

again, ideally I want to be able to run a command such as

inference_data.sel(curriculum=[1], district=[2]) 

I hope I was able to explain myself clearly enough. Thanks a lot in advance!

I believe this is actually just a question about xarray (which arviz uses in their data model), so you may be better off asking in that community, but I think the answer is you can add multiple coordinates to the same dimension but can’t use more than one for selection (basically, one of them is always the “index”)

You might be able to do whatever you ultimately want with something like a group_by operation, it depends

1 Like

Thanks! I should be able to get around this by using one index for the object and an external reference to do more complicated selection

Looks to me like there is some confusion about the dimensions in your model. I don’t think you have anything in your model with a district or curriculum dimension, instead you have a observation dimension.
The ind_district and ind_curriculum values aren’t dimensions either, I’d think of them as datasets with a observation dimension that indicate which district or curriculum each observation came from.
Only if you have a district_effect that has length num_districts would your model contain a value with dimension district.