I am writing to seek guidance and advice regarding a modeling project I am currently working on. I must admit that I am not a statistician or a programmer by profession, but I find myself facing significant challenges as I approach the deadline for my project.
In an effort to address these challenges, I have explored various modeling tools, ultimately deciding that a Bayesian belief network is the most suitable approach for my project. After careful investigation, I have determined that Stan is the most appropriate tool to meet my project requirements. I have previously experimented with Bayesian server software and other tools such as PyMC3 and pyBBN. However, I found that these alternatives had limitations that did not align with the specific needs of my application, leading me to choose Stan.
While I have made progress in using Stan, I must admit that I have encountered some difficulties along the way. One of the challenges I have faced is with the documentation. From my perspective as a non-expert, it appears that the documentation is primarily geared toward Stan developers and high-level statisticians. It often lacks the level of detail and explanation that would be most helpful to someone in my position. Nevertheless, I have managed to navigate through some of these challenges.
One concept that continues to elude my full understanding is that of “marginalization.” I am unsure why it is necessary to marginalize and transform parameters, especially when dealing with discrete variables or discrete distributions. Additionally, my research has revealed that there is a scarcity of practical examples and resources that delve into Bayesian belief networks beyond the well-known “Berkeley Alarm” example. This leaves me with many unanswered questions and uncertainties.
Now, if I may, let me provide an overview of my current project. I am tasked with modeling a real-world problem that involves 47 input variables falling into four categories: continuous data from measuring devices, ranking questions, yes-no questions, and categorical questions, each with three or more categories. Additionally, there are 10 latent variables for which I have little information about their distribution. Finally, I must predict the outcomes of seven output variables.
To address this complex problem, I have chosen to define the latent variables and output variables as Gaussian distributions with location = 50 and scale = 20. As you can see from this description, each latent variable receives input from a different type of input variable distribution.
During my research, I was able to find approaches for modeling mixed Gaussian inputs and generating joint distributions from Gaussian inputs. I also discovered a solution for modeling joint probability distributions between continuous and binary inputs, as outlined in an article by cwolf. However, I remain uncertain about how to effectively model the joint probability distribution of continuous and categorical variables and how to incorporate continuous, ranking categorical, and binary distributions into a single joint distribution.
I have come across some functions known as “cupola functions” that seem to offer a partial solution to my challenges. However, I have not yet found a comprehensive solution to address the complexity of modeling the joint distribution of these diverse variable types.
I appreciate your time and any insights or guidance you can provide to help me overcome these challenges and advance my project. Your expertise and assistance would be immensely valuable as I work towards a successful outcome.
Thank you for your attention and support.