Conventions vary from field to field. The commonality is a prior density of the form \text{gamma}(y; \epsilon, \epsilon) or even \text{inv-gamma}(y; \epsilon, \epsilon) so that the density concentrates against y = 0 instead of suppressing it. The conventional value of \epsilon changes across time and disciplines.
Yes, while I had the Jacobian correct in my haste to get an answer out I made the cardinal sin of naive multiplication on the linear scale, dgamma(y, epsilon, epsilon) * exp(y) instead of canceling everything on the log scale to avoid floating point issues. Implementing the log density properly gives a result that agrees with yours,
Indeed the problem isn’t the location of the typical set relative to the initialization but rather the asymmetry in the log density function. If warmup spends too much time on the left then the adaptation will end up in too aggressive of a step size configuration which can lead to occasionally divergences when exploring the sharp drop off on the right. The stochastic nature of the adaptation can also lead to some fits exhibiting divergences and some fits not exhibiting divergences.
Increasing adapt_delta forces a less aggressive step size adaptation which should help, as does changing the prior model to soften that drop off in the log density function.
