Sampling from a matrix in STAN

Hi, all stan users and experts,

I am dealing with some issues when coding a program using stan. In my stan code, I need to take random data samples (data raws) from the generated data matrix relevant to the likelihood function. But, I don’t know how to do that. Can anyone help me here?

Thank you

I think there’s some ambiguity in what you are trying to do. Are you able to share a bit more about the model? Note that the likelihood, as a function of the data and the parameters must be completely deterministic in a Stan program. Random-number generation is restricted to the transformed data and generated quantities blocks of a Stan program.

Hi @jsocolar,

Thanks for getting back to me. Okay, I will try to explain in this way.

I am trying to use stan in the Bayesian calibration of the microsimulation lung cancer model and here, the likelihood function follows poison distribution. At this point, I need to figure out how many individuals at each disease state with particular age groups. What I am trying to do is simulate lung cancer data and then do sampling to find summary distributions.

Does that make sense for you? Are there any different kinds of approaches for doing this?

Thank you

It sounds like maybe you want to repeatedly simulate datasets under known parameters and then fit a model to those datasets using Stan. Is that an accurate description? In general, the way to do this is to simulate the data externally to Stan and then fit the model using Stan.

That’s great. Ultimately, I need to end up with calibrated parameters.

But, I don’t know how to link data that is simulated externally with stan to calibrate parameters. If you have any idea please kindly share with me.

Thank you

So I should correct myself a little bit. The dataset simulation can be done inside a Stan program, but if you aren’t comfortable setting that up then the most straightforward and familiar way to achieve this will be to simulate the data externally.

For an overview of how to do it all inside the Stan program (maybe you’ve already seen this), consult the Stan User’s Guide here: 26.3 SBC in Stan | Stan User’s Guide

To do it outside of the Stan program, just simulate your data in R (or Python etc, but I see from your previous posts that you use R) and then fit the Stan model to those data using rstan or cmdstanr.

First, thank you @jsocolar ,

Yes, I am using both rstan and cmdstanr (but, I would prefer to work with cmdstanr). Here, I will follow SBC in stan.

One thing, let’s say there are 100000 individual simulations in my data table, and a random sample of 1000 individuals is needed to be sampled. What stan command can be used for that?

Thank you

I’m sorry; I don’t understand what you are trying to achieve. Can you provide some code to illustrate what you are trying to achieve and where you are getting hung up?

In what data table?

For what purpose?

Sorry for the inconveniences caused to you.

This is the paper I am following.

Okay, lets’ say like that, first, I need to simulate 1000 micro-simulations as individual-level data, then, I need to make samples of size 100 from these micro-simulations in order to count how many individuals are in different disease states with time (different independent samples for each disease states)
Also I need to repeat these steps 10 times(say) using a for loop.

Does that make sense for you? :(

Thank you.

I think I still don’t understand. What you describe above can be achieved by running 1000 micro-simulations using the MILC package mentioned in the abstract of the paper you linked, then randomly selecting 100 of them (but why then did you bother to simulate the other 900?) with something like base::sample(1:1000, 100), and wrapping this whole process in a for (i in 1:10) {...} loop. The end result of this process would be that you would have 1000 microsimulations sitting saved in your R environment.

Clearly this isn’t what you’re aiming for. If it were, you would just run the microsimulation 1000 times and be done with it. Since you’re talking about doing this with Stan, I imagine that at some point you want to fit a statistical model to the output of the simulations. And this is still the missing piece in your description of the problem. What is that Stan model? To answer this question, don’t worry about the part of the Stan program that simulates the data; pretend like you have a single fixed dataset. What does that dataset look like? What model are you trying to fit to those data? How do you write the model in Stan (provide code)? I think we’ll be able to move forward if we can get ourselves on the same page about what the model you’re working with looks like.

And don’t worry about any inconvenience to me; I answer questions here because I enjoy it.

2 Likes

Hi @jsocolar ,

I much appreciated your ideas.
I hope I can solve the issue in my code along with what you have suggested to me. I am doing manual coding rather than using MILC package and also this a small part of my work. So, it is little bit hard to explain all here.

Will back here if i have further coding errors.

Thank you

Hi @jsocolar,

I am here again.
I need to find the inverse of CDF of equation (1) described in the attached paper above.
Can I use inverse() for that?

Any idea is highly appreciated.

Thank you

No, inverse in Stan is for inverting a matrix, not for finding the inverse of a function. Your best bet, I’d guess, is to plug the function into Wolfram and ask it for the inverse. Note that I don’t have access to the paper, and can’t see equation 1, so it’s possible that there are better options depending on the function itself.

See the equation (1) here

ijm-00164.pdf (1.4 MB)

Hi @jsocolar ,

I am weird with some situations when going to define functions for random number generation. See this function below, and I need to use this y in user defined function block (i.e. in my likelihood). As it is not possible to use _rng in functions block, where should I accurately define this?

real fun1(real x1; real x2){ 
    real y = exponential_rng(1/83.5);
    if(x1 < y && y < x2){
      return(y);
    } else {
      fun1(x1,x2);
    }
  }

Thank you.

You cannot use random number generation anywhere in the likelihood. The likelihood must be a deterministic function of the parameters and data.

What distribution do you need the inverse cdf for? The distribution of failure times implied by that function? In any case, either Wolfram can give you a usable inverse cdf or it can’t. If it can’t, then I definitely won’t have anything to add :/

Thank you.
Okay, even though random numbers can be generated in transformed parameter block or generated quantity block, there are no way to generate it using a function as I mentioned in above R code?

Random numbers cannot be generated in the transformed parameter block, just in transformed data and generated quantities. The key principle here is that the likelihood must be a deterministic function of the parameters and the data/transformed data. For more see the Stan Users Guide section here: 18.4 Functions acting as random number generators | Stan User’s Guide

ohh… yes not transformed parameter block, I meant transformed data block.

Thank you