Proof of concept: Binary output format for cmdstan

sakrejda · April 7, 2026, 5:03pm

Sure and your point was that we should focus on working with the current writer interface. I’m sure I could get something on par with R relatively quickly, including tuples, having done most of it previously. The main issue is the team deciding where to put that sort of thing in the current code base… which has always been the main barrier.

WardBrian · April 7, 2026, 5:04pm

Was there a PR resulting from your previous arrow work?

avehtari · April 7, 2026, 5:09pm

tuples seem to work fine with CmdStanR and posterior. I tested with following Stan file

parameters {
  real a_scalar;
  array[2] real b_array;
  matrix[2,2] c_matrix;
}
model {
  a_scalar ~ normal(0, 1);
  b_array ~ normal(0, 1);
  to_vector(c_matrix) ~ normal(0, 1);
}
generated quantities {
  tuple(real, array[2] real, matrix[2,2]) d_tuple = (a_scalar, b_array, c_matrix);
  complex z = b_array[1] + b_array[2] * 1i;
}

Running in R

library(cmdstanr)
library(posterior)
mod <- cmdstan_model("tuple_test.stan")
fit <- mod$sample()
dr <- fit$draws()

and then three examples of formats

> variables(dr)
 [1] "lp__"           "a_scalar"       "b_array[1]"     "b_array[2]"     "c_matrix[1,1]" 
 [6] "c_matrix[2,1]"  "c_matrix[1,2]"  "c_matrix[2,2]"  "d_tuple:1"      "d_tuple:2[1]"  
[11] "d_tuple:2[2]"   "d_tuple:3[1,1]" "d_tuple:3[2,1]" "d_tuple:3[1,2]" "d_tuple:3[2,2]"
[16] "z[real]"        "z[imag]"

> as_draws_rvars(dr)
# A draws_rvars: 1000 iterations, 4 chains, and 8 variables
$lp__: rvar<1000,4>[1] mean ± sd:
[1] -3.5 ± 1.9 

$a_scalar: rvar<1000,4>[1] mean ± sd:
[1] 0.00056 ± 1 

$b_array: rvar<1000,4>[2] mean ± sd:
[1]  0.014 ± 0.99  -0.015 ± 0.98 

$c_matrix: rvar<1000,4>[2,2] mean ± sd:
     [,1]          [,2]         
[1,] -0.00017 ± 1   0.00309 ± 1 
[2,] -0.02284 ± 1   0.00932 ± 1 

$d_tuple:1: rvar<1000,4>[1] mean ± sd:
[1] 0.00056 ± 1 

$d_tuple:2: rvar<1000,4>[2] mean ± sd:
[1]  0.014 ± 0.99  -0.015 ± 0.98 

$d_tuple:3: rvar<1000,4>[2,2] mean ± sd:
     [,1]          [,2]         
[1,] -0.00017 ± 1   0.00309 ± 1 
[2,] -0.02284 ± 1   0.00932 ± 1 

$z: rvar<1000,4>[2] mean ± sd:
          real           imag 
 0.014 ± 0.99  -0.015 ± 0.98

> extract_list_of_variable_arrays(dr) |> str()
List of 8
 $ lp__     : num [1:1000, 1:4, 1] -4.28 -5.06 -4.23 -0.88 -1.36 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
 $ a_scalar : num [1:1000, 1:4, 1] 1.642 1.887 0.919 0.349 -0.539 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
 $ b_array  : num [1:1000, 1:4, 1:2] 0.5982 -0.0339 -0.5241 -0.1028 0.1982 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
 $ c_matrix : num [1:1000, 1:4, 1:2, 1:2] 0.627 -1.135 -1.783 0.674 -1.062 ...
  ..- attr(*, "dimnames")=List of 4
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
 $ d_tuple:1: num [1:1000, 1:4, 1] 1.642 1.887 0.919 0.349 -0.539 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
 $ d_tuple:2: num [1:1000, 1:4, 1:2] 0.5982 -0.0339 -0.5241 -0.1028 0.1982 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
 $ d_tuple:3: num [1:1000, 1:4, 1:2, 1:2] 0.627 -1.135 -1.783 0.674 -1.062 ...
  ..- attr(*, "dimnames")=List of 4
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : NULL
 $ z        : num [1:1000, 1:4, 1:2] 0.5982 -0.0339 -0.5241 -0.1028 0.1982 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : NULL
  .. ..$ : NULL
  .. ..$ : chr [1:2] "real" "imag"

EDIT: added complex type

WardBrian · April 7, 2026, 5:12pm

My guess is it will behave slightly less well if you have an array of tuples or nested tuples, but perhaps it has been improved since I last tried!

sakrejda · April 7, 2026, 5:13pm

No, we never arrived at an answer about where to integrate that sort of thing. This links to the last time we talked about it: Usage of Arrow with Stan

First post there links to the code, main event is in stan/src/stan/callbacks/arrow_ipc_writer.hpp at develop-with-arrow · sakrejda/stan · GitHub

My main point is that deciding where something like this gets integrated needs to get addressed

avehtari · April 7, 2026, 5:18pm

Give me a Stan file and I can test. Also added complex to the above example

spinkney · April 7, 2026, 6:15pm

Aki try

parameters {
  real a_scalar;
  tuple(tuple(array[2] real, vector[2]), matrix[2, 2]) b_tuple;
  matrix[2,2] c_matrix;
}
model {
  a_scalar ~ normal(0, 1);
  b_tuple.1.1 ~ normal(0, 1);  // array sized 2
  b_tuple.1.2 ~ normal(0, 1);  // vector sized 2
  to_vector(b_tuple.2) ~ normal(0, 1); // matrix sized 2 x 2
  to_vector(c_matrix) ~ normal(0, 1);
}
generated quantities {
  tuple(real, tuple(tuple(array[2] real, vector[2]), matrix[2, 2]), matrix[2,2]) d_tuple = (a_scalar, b_tuple, c_matrix);
  complex z = b_tuple.1.1[1] + b_tuple.1.1[2] * 1i;
}

After fitting I get the warning

Warning messages:
1: In variable_dims(metadata$variables) : NAs introduced by coercion
2: In variable_dims(metadata$variables) : NAs introduced by coercion

WardBrian · April 7, 2026, 6:20pm

This is a subset of the models we test the output of in stanio’s tests:

parameters {
  real mu;
}
model {
  mu ~ normal(0, 1);
}
generated quantities {
  matrix[2, 3] m = mu * to_matrix(linspaced_vector(6, 5, 11), 2, 3);
  array[4] matrix[2, 3] threeD;
  for (i in 1 : 4) {
    threeD[i] = i * mu * to_matrix(linspaced_vector(6, 5, 11), 2, 3);
  }
  
  real nu = normal_rng(mu, 1);
  complex z = nu + nu * 2.0i;
  complex_vector[2] zv = to_complex([3 * nu, 5 * nu]', [nu * 4, nu * 6]');
  complex_matrix[2, 3] zm = to_complex(m, m + 1);
  array[4] complex_matrix[2, 3] z3D;
  for (i in 1 : 4) {
    z3D[i] = to_complex(threeD[i], threeD[i] + 1);
  }
  
  real base = normal_rng(0, 1);
  int base_i = to_int(normal_rng(10, 10));
  
  tuple(real, real) pair = (base, base * 2);
  
  tuple(real, tuple(int, complex)) nested = (base * 3, (base_i, base * 4.0i));
  array[2] tuple(real, real) arr_pair = {pair, (base * 5, base * 6)};
}

Besides just testing that the naming is correct, it’s also worth testing that the correct pieces are labeled as the real/imaginary parts. cmdstanpy had a bug in its initial handling of these

avehtari · April 7, 2026, 7:29pm

Both additional Stan examples seem to work fine

> variables(dr2)
 [1] "lp__"             "a_scalar"         "b_tuple:1:1[1]"   "b_tuple:1:1[2]"   "b_tuple:1:2[1]"  
 [6] "b_tuple:1:2[2]"   "b_tuple:2[1,1]"   "b_tuple:2[2,1]"   "b_tuple:2[1,2]"   "b_tuple:2[2,2]"  
[11] "c_matrix[1,1]"    "c_matrix[2,1]"    "c_matrix[1,2]"    "c_matrix[2,2]"    "d_tuple:1"       
[16] "d_tuple:2:1:1[1]" "d_tuple:2:1:1[2]" "d_tuple:2:1:2[1]" "d_tuple:2:1:2[2]" "d_tuple:2:2[1,1]"
[21] "d_tuple:2:2[2,1]" "d_tuple:2:2[1,2]" "d_tuple:2:2[2,2]" "d_tuple:3[1,1]"   "d_tuple:3[2,1]"  
[26] "d_tuple:3[1,2]"   "d_tuple:3[2,2]"   "z[real]"          "z[imag]"

variables(dr3)
  [1] "lp__"             "mu"               "m[1,1]"           "m[2,1]"           "m[1,2]"          
  [6] "m[2,2]"           "m[1,3]"           "m[2,3]"           "threeD[1,1,1]"    "threeD[2,1,1]"   
 [11] "threeD[3,1,1]"    "threeD[4,1,1]"    "threeD[1,2,1]"    "threeD[2,2,1]"    "threeD[3,2,1]"   
 [16] "threeD[4,2,1]"    "threeD[1,1,2]"    "threeD[2,1,2]"    "threeD[3,1,2]"    "threeD[4,1,2]"   
 [21] "threeD[1,2,2]"    "threeD[2,2,2]"    "threeD[3,2,2]"    "threeD[4,2,2]"    "threeD[1,1,3]"   
 [26] "threeD[2,1,3]"    "threeD[3,1,3]"    "threeD[4,1,3]"    "threeD[1,2,3]"    "threeD[2,2,3]"   
 [31] "threeD[3,2,3]"    "threeD[4,2,3]"    "nu"               "z[real]"          "z[imag]"         
 [36] "zv[1,real]"       "zv[1,imag]"       "zv[2,real]"       "zv[2,imag]"       "zm[1,1,real]"    
 [41] "zm[1,1,imag]"     "zm[2,1,real]"     "zm[2,1,imag]"     "zm[1,2,real]"     "zm[1,2,imag]"    
 [46] "zm[2,2,real]"     "zm[2,2,imag]"     "zm[1,3,real]"     "zm[1,3,imag]"     "zm[2,3,real]"    
 [51] "zm[2,3,imag]"     "z3D[1,1,1,real]"  "z3D[1,1,1,imag]"  "z3D[2,1,1,real]"  "z3D[2,1,1,imag]" 
 [56] "z3D[3,1,1,real]"  "z3D[3,1,1,imag]"  "z3D[4,1,1,real]"  "z3D[4,1,1,imag]"  "z3D[1,2,1,real]" 
 [61] "z3D[1,2,1,imag]"  "z3D[2,2,1,real]"  "z3D[2,2,1,imag]"  "z3D[3,2,1,real]"  "z3D[3,2,1,imag]" 
 [66] "z3D[4,2,1,real]"  "z3D[4,2,1,imag]"  "z3D[1,1,2,real]"  "z3D[1,1,2,imag]"  "z3D[2,1,2,real]" 
 [71] "z3D[2,1,2,imag]"  "z3D[3,1,2,real]"  "z3D[3,1,2,imag]"  "z3D[4,1,2,real]"  "z3D[4,1,2,imag]" 
 [76] "z3D[1,2,2,real]"  "z3D[1,2,2,imag]"  "z3D[2,2,2,real]"  "z3D[2,2,2,imag]"  "z3D[3,2,2,real]" 
 [81] "z3D[3,2,2,imag]"  "z3D[4,2,2,real]"  "z3D[4,2,2,imag]"  "z3D[1,1,3,real]"  "z3D[1,1,3,imag]" 
 [86] "z3D[2,1,3,real]"  "z3D[2,1,3,imag]"  "z3D[3,1,3,real]"  "z3D[3,1,3,imag]"  "z3D[4,1,3,real]" 
 [91] "z3D[4,1,3,imag]"  "z3D[1,2,3,real]"  "z3D[1,2,3,imag]"  "z3D[2,2,3,real]"  "z3D[2,2,3,imag]" 
 [96] "z3D[3,2,3,real]"  "z3D[3,2,3,imag]"  "z3D[4,2,3,real]"  "z3D[4,2,3,imag]"  "base"            
[101] "base_i"           "pair:1"           "pair:2"           "nested:1"         "nested:2:1"      
[106] "nested:2:2[real]" "nested:2:2[imag]" "arr_pair[1:1]"    "arr_pair[1:2]"    "arr_pair[2:1]"   
[111] "arr_pair[2:2]"

I only eyeballed the names.

I did get that also with my example, but ignored it as I’m used to seeing these when there are constant variables. I’ll make an issue.

I’m also curious to see a real model using nested tuples

spinkney · April 7, 2026, 7:51pm

I find it confusing to use nested tuples unless we had structs. Today we use numbers to index into the tuple but being able to name the element within the tuple would make it easier to understand which element is being accessed.

In the above to access the array[2] real from tuple(tuple(array[2] real, vector[2]), matrix[2, 2]) b_tuple; I had to call b_tuple.1.1. With structs we could name the parts of the tuple and use the name to access.

 tuple(tuple(array[2] real, vector[2]), matrix[2, 2]) b_tuple; // current way

 tuple(tuple(array[2] real b_array, vector[2] b_vector) b_in, matrix[2, 2]) b_tuple;

...
 // access the first element of the array[2] real named b_array
 real x = b_tuple.b_in.b_array[1];

Brian probably knows of other (better) ways to write this and access these

avehtari · April 7, 2026, 7:51pm

There already was a CmdStanR issue NA dimensions for complex and tuple types · Issue #925 · stan-dev/cmdstanr · GitHub, but apparently posterior package can still handle complex and tuple types

WardBrian · April 7, 2026, 8:52pm

If you ask for this element, what do you actually get?

andre.pfeuffer · April 8, 2026, 5:11am

Could we include the data and transformed_data in the binary output format? Would help not only for debugging models.

just my 2 cents.

avehtari · April 8, 2026, 6:41am

Depends on which format you ask. The default is draws_array

> dr3 |> subset_draws(variable="arr_pair[2:2]")
# A draws_array: 1000 iterations, 4 chains, and 1 variables
, , variable = arr_pair[2:2]

         chain
iteration    1     2    3     4
        1 0.67 -5.54  3.6  5.28
        2 3.54  0.92 -8.3 -0.86
        3 1.19  0.63 -8.0 -2.61
        4 3.33  3.61  3.2  4.07
        5 7.79  4.12  8.4 -2.93

But you can also ask draws_df, draws_matrix, draws_rvars, numeric, array or matrix (showing just two of these)

> dr3 |> as_draws_df() |> subset_draws(variable="arr_pair[2:2]")
# A draws_df: 1000 iterations, 4 chains, and 1 variables
   arr_pair[2:2]
1           0.67
2           3.54
3           1.19
4           3.33
5           7.79
6          -0.57
7          -6.35
8           3.45
9          -1.61
10         -2.27
# ... with 3990 more draws
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}

> dr3 |> as_draws_rvars() |> subset_draws(variable="arr_pair")
# A draws_rvars: 1000 iterations, 4 chains, and 1 variables
$arr_pair: rvar<1000,4>[4] mean ± sd:
         1:1          1:2          2:1          2:2 
-0.0069 ± 1  -0.0138 ± 2  -0.0346 ± 5  -0.0415 ± 6

WardBrian · April 8, 2026, 1:33pm

There’s an open feature request to give access to the transformed data, but at the moment the rest of Stan outside the model can’t really access it

Topic		Replies	Views
Alternative .csv reader Developers maintenance , rstan	53	4133	September 11, 2018
JSON Output for STAN Developers	30	1184	October 3, 2023
Speed difference between rstan and cmdstan for a simple model CmdStan rstan , techniques	25	3772	November 7, 2021
Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR) Developers	13	3312	July 14, 2021
CmdStan dataset loading speed Developers cmdstan	21	1455	February 14, 2020

Proof of concept: Binary output format for cmdstan

Related topics