How to model mixed variables Gaussian, Ranking binary and categorical in stan all together to generate a joint distribution of them?

I am writing to seek guidance and advice regarding a modeling project I am currently working on. I must admit that I am not a statistician or a programmer by profession, but I find myself facing significant challenges as I approach the deadline for my project.

In an effort to address these challenges, I have explored various modeling tools, ultimately deciding that a Bayesian belief network is the most suitable approach for my project. After careful investigation, I have determined that Stan is the most appropriate tool to meet my project requirements. I have previously experimented with Bayesian server software and other tools such as PyMC3 and pyBBN. However, I found that these alternatives had limitations that did not align with the specific needs of my application, leading me to choose Stan.

While I have made progress in using Stan, I must admit that I have encountered some difficulties along the way. One of the challenges I have faced is with the documentation. From my perspective as a non-expert, it appears that the documentation is primarily geared toward Stan developers and high-level statisticians. It often lacks the level of detail and explanation that would be most helpful to someone in my position. Nevertheless, I have managed to navigate through some of these challenges.

One concept that continues to elude my full understanding is that of “marginalization.” I am unsure why it is necessary to marginalize and transform parameters, especially when dealing with discrete variables or discrete distributions. Additionally, my research has revealed that there is a scarcity of practical examples and resources that delve into Bayesian belief networks beyond the well-known “Berkeley Alarm” example. This leaves me with many unanswered questions and uncertainties.

Now, if I may, let me provide an overview of my current project. I am tasked with modeling a real-world problem that involves 47 input variables falling into four categories: continuous data from measuring devices, ranking questions, yes-no questions, and categorical questions, each with three or more categories. Additionally, there are 10 latent variables for which I have little information about their distribution. Finally, I must predict the outcomes of seven output variables.

To address this complex problem, I have chosen to define the latent variables and output variables as Gaussian distributions with location = 50 and scale = 20. As you can see from this description, each latent variable receives input from a different type of input variable distribution.

During my research, I was able to find approaches for modeling mixed Gaussian inputs and generating joint distributions from Gaussian inputs. I also discovered a solution for modeling joint probability distributions between continuous and binary inputs, as outlined in an article by cwolf. However, I remain uncertain about how to effectively model the joint probability distribution of continuous and categorical variables and how to incorporate continuous, ranking categorical, and binary distributions into a single joint distribution.

I have come across some functions known as “cupola functions” that seem to offer a partial solution to my challenges. However, I have not yet found a comprehensive solution to address the complexity of modeling the joint distribution of these diverse variable types.

I appreciate your time and any insights or guidance you can provide to help me overcome these challenges and advance my project. Your expertise and assistance would be immensely valuable as I work towards a successful outcome.

Thank you for your attention and support.

Hi, @Meqdad_Hasan: Our target audience for the Reference Manual is programmers and our target audience for the User’s Guide and Functions Reference is applied Bayesian statisticians. None of them are tutorial in nature by design.

We have a lot of more tutorial case studies and StanCon talks:

Many other people have written introductions, often aimed at particular subfields.

Do you mean priors?

Each output variable can be modeled with a generalized linear model. If there are multiple linear regressions, you can model them with a “seemingly unrelated regression” in case their errors are correlated. Copulas are a little different—they let you model marginals and the combine with covariance structure. Overall, this is a pretty complicated kind of model for a first effort!

We don’t implement discrete parameter sampling in Stan. Discrete parameter sampling tends to be very inefficient.

There are a ton of introductions to Bayesian stats online. There are whole Coursera courses, for example. Richard McElreath and Ben Goodrich and Mike Lawrence all have full-semester introductions to Bayesian stats using Stan on YouTube.

I’m curious what the decision was based on if you wouldn’t mind sharing.

Yes, as a prior.

Thank you. I already started studying them. I am also getting some good results in coding and the model.

Actually, the main reason is the efficient way of implementing Hamiltonian MCMC in Stan is the main reason. The other two, HMC is not directly implemented. Moreover, I wanted to work efficiently with continuous data. Most of the examples on both Python packages I studied were using discrete data. Although there were some separate implementations of continuous data on GitHub using pyBBN. I found it more comfortable to work with a larger community, well-known, package or language. On the other hand, there are two other software, Bayesian Server, which is not free, and its trial version doesn’t save. This let me not work on it, especially since I am not free, and I want to save my small work for each step for the next time. The other one is “Bayesfusion” which has a very useful academic version. However, this version does not support linking discrete parent nodes, such as categorical to a continuous child node. Thus I stopped working on it.