I have a single chain sampling with 1000 warmup and 1500 sampling iterations that I run with cmdstanpy. Running the diagnose command on the output data shows that the chain performs really bad (high divergence, very low BFMI).
However, my posterior predictive checks look very good and I am able to recover the original data.
What am I missing? How is this even possible?

It may be better if you could provide model details such as R hat or the procedure/plot used for PPC. In general, I think, it is better to have convergence before talking anything else.

Thanks! R hats are all bad, but since I have only one chain I am not sure how this is indicative. What do you mean by the procedure used for the PPC?
I use the rng functions, as in the intertemporal choice task, for example:

Posterior predictive checking is good for detecting when a model provides poor fit to the data, but it is not a replacement for MCMC diagnostics, which donâ€™t care about the quality of fit to data and instead have to do with whether the MCMC chains have converged to the correct posterior distribution (i.e. the correct distribution given the model, which may or may not be misspecified). Suppose the true posterior distribution for a one-parameter model is normal(0,3). And suppose your MCMC fitting fails and yields and incorrect posterior of normal(1, 0.1). This is terrible inference that must not be trusted, but posterior predictive checking would look fine (as long as the model isnâ€™t badly misspecified)! We know posterior predictive checking would look fine (as long as the model is well specified) because normal(0, 0.1) is entirely consistent with the correct posterior and therefore must be capable of generating data that look comparable to the observed data.

Therefore, it is crucial to first get a model that passes the MCMC diagnostics (R-hat, effective sample size, divergences, BFMI). And then itâ€™s time to do posterior predictive checking to see whether the model that youâ€™ve fit is consistent with (and therefore hopefully a reasonable approximation to) the true data generating process.

Note that PPCs arenâ€™t always very sensitive even for what they are designed to do (i.e. diagnose model misspecification). Posterior predictive checking is a bit of a fishing expedition to search for bad fit. Different PPCs will be sensitive to different forms of misspecification, and no PPC is sensitive to every form of misspecification. And even well-chosen PPCs (tailored by a careful researcher to detect particular relevant forms of misspecification) might be underpowered if the dataset is small.

Thank you so much!! I know it is a different topic, but where do I start to understand what is wrong with my model? I have ~25k parameters and a few million data pointsâ€¦
Also does it even make sense to run only 1 chain? I know that R-hat estimates are not indicative with 1 chain, but what if all the other parameters are ok?

Running a single chain is heavily discouraged precisely because it makes the Rhat much less sensitive. Debugging big models ishard and it is usually sensible to start with a smaller model and only a subset of the data - hard to say anythingmorespecific without seeing your model and/or your data. Some additional pointers can be found in the â€śBayesian workflowâ€ť preprint by Gelman et al. and at Divergent transitions - a primer

Because we can run chains parallelly, you should check with 3 or 4 chains. That way when you see the trace plot, it gives you more information than only one chain.

Thank you all! My model was sampling fine (with 1 chain) before I added the QR decomposition. So I am trying to figure out whatâ€™s up there.
I realize that one chain is suboptimal and that all Stan tutorials use 4 chains. Would it make sense to run 2 chains, or is this heavily discouraged as well?

The more the better, so 2 better than one. :-) Eventually it is about the trafeoff you are willing to make between compute resources consumed and the risk of getting wrong/nonsensical inferencesâ€¦

Still, it is a) quite likely there is a problem with the model, so running with a subset and simpler model first is definitely a good step (because it letâ€™s you iterate much faster). Issues with the model also frequently mean the model samples much slower (a.k.a. the â€śfolk theoremâ€ť) and b) you might want to consider using an approximation: INLA is quite a good package for approximate multilevel models if your model can fit this box. But point a) still holds even there - especially testing the smaller model/data works with Stan, checking the results on smaller model data are the same between Stan and INLA and then running INLA for the big model is IMHO a very robust workflow.

I think, because of this, it takes a long computational time. If you have to run with the full dataset, I would still suggest to check the Stan program with a subset first. Maybe you see something already with that subset. To check for problem with codes, convergence (but not final estimates), in my experience, I often check with 100, 500, 1000, or 2000. After that, I often set 5000 for final estimates.