Sharing fitted brms objects after removing data

Question Context
This is a fairly open-ended question regarding best practices for open science using restricted third party data. By virtue of the data used for this study, I cannot share the raw data directly as this must be requested through the third party, but I do still want to share as much of my data analysis as possible. Included will be the code used to convert the datafile to the data used in my study, so anyone with the data should be able to repeat all of my descriptives/analyses/etc. Since I have all the brms models saved, I would ideally share these fitted objects; however, these objects have a save the data used to fit the models.

I believe that this can be fairly easily resolved with the following:

modFit$data <- NULL

Questions

  1. Does this declaration successfully remove all trace of the used data from the object? I know on there is a plan to clean up some redundancy in brmsfit objects, but it doesnā€™t seem like there is any additional redundancy of data (at least per this).

  2. What model validations does removing the stored data prevent? Iā€™m fairly sure that anything requiring predictions will be impossible (e.g., pp_check(), residuals(), etc.).

  3. Without the ability to perform certain inspections of the models, is it worthwhile to bother sharing the objects at all? I declared a seed in R and in the call to brm(..., seed = ###), so Iā€™m hoping that rerunning the models in another R session/environment should yield very similar results anyway (but obviously at the computational and temporal cost of someone else having to re-run the models).

  4. As an alternative to providing the entire brms fitted model as a .rds file, would it be perhaps more utilitarian to share the just posteriors? Iā€™ll be providing supplementary material with diagnostics, fit results, model comparisons, etc., so the need to have the whole model may be unnecessary for 99% of readers (the remaining 1% could then theoretically run the models themselves using provided R scripts). It would seem to me then the only thing that someone would want for future research needs would be the posteriors, but I may be overlooking something.

I appreciate any feedback and readings on navigating open science with R objects that store publicly available but restricted access data, particularly as it relates to Bayesian results since there are implications for future priors from having existing posteriors.

5 Likes

Another option is to create simulated data that has the same statistical properties and share that. Iā€™ve done this on a few projects were the data were confidential but folks wanted to investigate the workflow and modeling choices.

3 Likes

@Ara_Winter, I like the idea of simulating data as a way of helping people play in the model and figure out what is being done under the hood.

I have been thinking about this topic for the last week now, and I wonder whether anyone has experience including the models as part of a Shiny + Markdown format. I donā€™t have enough technical knowledge to know whether any objects included as part of a Markdown document might be able to accessed afterward from the .html file, but I would think that it should be possible to create an interactive Markdown document where reviewers/readers could select a model from a dropdown and then choose the diagnostics or summaries that they want to see.

In a practical sense, this would allow access to a selection of the nice features of brms for model summaries and diagnostics without having to give up the actual R object and thus avoiding the need to null out the raw data. I do worry that the size of the final file would be quite large when there are many models examined as part of a study on decently sized datasets. Plus, I donā€™t know enough about how these kinds of files work to know whether the raw data might be vulnerable to being extracted.

2 Likes

So Shiny can do almost anything, so it can definitely also include a model. On the other hand, pure markdown canā€™t easily share arbitrary files (I donā€™t doubt you could do all sort of javascript hacks to make it happen, but it is IMHO not directly available)

For sharing the full fit, you would need at least to erase the $data member (possibly also $data2 if you use it). This way, only limited information will leak directly (e.g. the names associated with all your grouping levels and factor levels). In most cases, I think it would be impossible to reconstruct the dataset from this fit. A possible risk would IMHO be if the number of model parameters was not substantially lower than the number of data points, then there might be - at least in theory - enough information for some reconstruction. If you have a few parameters and a lot of data, you are most likely safe. If this is highly sensitive data, I would do a bit more investigation of the literature on this topic.

The problem with just erasing the data is that, any feature of brms the requires prediction under the hood (e.g. posterior-predictive checks, conditional effects, ā€¦) will not work. The users can only inspect the fitted parameters (which might be enough - depends on the use case). So if you also need PP checks or similar you IMHO either need to generate the summaries once or have an app that keeps the underlying model (and thus data) private. But once you e.g. allow the user to do a bunch of posterior predictive checks, they will also be able to access some summaries of the data (as those are immediately visible in PP check). If you allow to generate too many summaries, you once again run the risk of an attacker reconstructing parts of the dataset from the summaries. I think this is a real risk only if you really allow a lot of summaries, but the point is that you canā€™t just let the user do whatever they please with the dataā€¦

Also note that there is shinystan which lets users inspect Stan (and brms) fits in a Shiny app, so exposing a shinystan instance might be a good and easy way to expose almost arbitrary model summaries. (I would still remove the data from the fit beforehand, to not even upload it to the server, just in case).

Hope that helps at least a bit.

3 Likes

Could be helpful,
https://synthpop.org.uk

3 Likes

Hello community!
I am facing the same (or almost) issue.
I want to make my brms model available but not the training data, which comes from third party.
I want to allow the simple usage of the model as a predictive tool without sharing the original data.
And honestly, I do not understand why one cannot do a very simple predict for new data after $data ā† NULL
I get an "Error in eval(predvars, data, env) : object ā€˜X1ā€™ not found " , being X1 the name of the first predictor variable.

Does someone has a solution for that? And perhaps an explanation for why a prediction with newdata is not possible?
Thank you very much!

Have you tried something like:

m$data$x1 <- NA
m$data$x2 <- NA
m$data$y <- NA
new_data <- cbind.data.frame(x1=0.5, x2=0.3)
predict(m, newdata=new_data)

That works for me. That way you remove the actual data but keep the column names and data.frame structure.

3 Likes

Nice, idea! It worked!
With the addition, since I have factor variables, I needed to keep the attribute contrasts.
Thank you!

2 Likes