I’m trying to fit a lognormal model (code below). This works just fine if I have a vector y. E.g. in R,
y <- exp(rnorm(1000))
stan.data <- list(y = y, n = length(y), alpha = 1, beta = 1)
stan.fit <- sampling(stanmodels::lognormal, data = stan.data)
where stanmodels::lognormal is defined as:
data {
real<lower=0> alpha;
real<lower=0> beta;
int<lower=0> n;
real<lower=0> y[n];
}
parameters {
real mu;
real<lower=0> sigmasq;
}
transformed parameters {
real<lower=0> sigma;
sigma = sqrt(sigmasq);
}
model {
sigmasq ~ inv_gamma(alpha, beta);
mu ~ normal(0, 1000);
y ~ lognormal(mu, sigma);
}
However, in my actual problem, I don’t have y. Instead I have mean(y), sd(y), and length(y). Is it possible to still use Stan to achieve the same fit as if I had y? I’m a novice to Stan, and I’ve looked through many examples and the official documentation, but can’t seem to find anything that mentions this problem.
Thanks in advance!
–sundar
It is basically a missing data problem with constraints given by the sample mean and the sample standard deviation.
Thanks for the reply! Should I simulate the data using these constraints? Then use my Stan code on the simulated data? Or is there a way to modify the Stan code to do this for me?
I would declare the values in the parameters block to be a simplex of size n
. Then use your sample mean and sample standard deviation to transform the simplex into something that can be described by a lognormal density. Make sure to adjust by the log of the absolute value of the determinant of the Jacobian of the transformation though.
If you can get the mean and sd of log y
rather than of y
, this would be easy. Otherwise, the non-linearity makes the constraint in the missing data problem tricky to code whether you try to impute log y
or y
from sd(y)
and mean(y)
and n
, where y > 0
.
Also, those wide priors are not helping matters.
Actually, for my actual problem, I have a choice of either mean(y) or mean(log(y)). Same for sd(log(y)). The value of y is guaranteed to be >0. This means I can use:
y <- exp(rnorm(1000))
stan.data <- list(logmean = mean(log(y)), logsd = sd(log(y)), n = length(n))
But I’m still unsure how to set up the model statement. I’m using Stan for the first time and still don’t have a full grasp of the language. Thank you for all your replies thus far!
–sundar
Having mean(log(y))
and sd(log(y))
is more useful than having mean(y)
and sd(y)
, but I have a hard time imagining how the former is possible and yet the underlying data are unavailable. In that case, the likelihood can be written as
which when you take the logarithm and write it in Stan comes out as
target += -0.5 * n * log(2 * pi() * sigma_squared)
- 0.5 * (n - 1) / sigma_squared * s_squared
- 0.5 * n / sigma_squared * square(theta - xbar);