Hi! I’m trying to fit the Latent Space Model (LSM) for an undirected weighted network.
This is the original paper, discussing the model for unweighted networks https://sites.stat.washington.edu/raftery/Research/PDF/hoff2002.pdf
The model, basically, is y_ij ~ Poisson(alpha - distance_ij)
i.e. each edge weight is modelled as a Poisson of parameter “some intercept minus the distance between the two extremities of the edge”. I use the squared distance because it’s faster.
The model is not identifiable for the positions in the latent space i.e. unvariant under rotation, translation, reflection. What I do (not included here) is do a Procrustes transform after on the positions. That part works fine.
In the generated quantities, I calculate distances between all positions. Those distances, however, should be identifiable.
My problem is: when the data is not too noisy, the distances are identifiable and that’s good. But when the data is noisy, there are problems in identifying even the distances.
Below is the stan model. Below the stan model are two examples with fake data. In the first example, the data is not too noisy. The distances (generated quantities) converge. In the second example, when the data is noisy, the chains do not converge.
In a third and final example, at the end, I use my real data and nothing converges.
data {
int<lower=0> N; // n edges
int<lower=0> M; // n nodes
int<lower=0> D; // D=2
array[N] int<lower=0> y; // edge weight
array[N] int<lower=1> from; // node i
array[N] int<lower=1> to; // node j
}
parameters {
real alpha; // intercept
matrix[M, D] K; // positions
}
transformed parameters {
array[N] real mu;
for (n in 1:N){
mu[n] = exp(alpha-squared_distance(K[from[n],],K[to[n],]));
}
}
model {
// priors
alpha ~ normal(1,3);
to_vector(K) ~ normal(0,3);
// likelihood
for (n in 1:N){
y[n] ~ poisson(mu[n]);
}
}
generated quantities{
vector[N] pred;
vector[N] log_lik;
vector[N] distances;
for (n in 1:N){
pred[n] = poisson_rng(mu[n]);
log_lik[n] = poisson_lpmf(y[n] | mu[n]);
distances[n] = squared_distance(K[from[n],],K[to[n],]);
}
}
Making fake data.
library(tidyverse)
library(rstan)
n <- 20
set.seed(222)
x <- rnorm(n,0,1)
y <- rnorm(n,0,1)
fake_df <- tibble(x=x,y=y,id=1:n)
mymat <- matrix(rep(0,n*n),ncol=n)
rownames(mymat) <- 1:n
colnames(mymat) <- 1:n
diag(mymat) <- NA
mymat[upper.tri(mymat)] <- NA
mymat <- as.data.frame(mymat) %>%
mutate(from=row.names(.)) %>%
gather(to,value,-from) %>%
filter(from!=to) %>% filter(complete.cases(.)) %>%
select(-value)
euclidean <- function(a, b) sqrt(sum((a - b)^2))
mymat$edist <- NA
for (i in 1:nrow(mymat)){
edist <- euclidean(unlist(fake_df[as.numeric(mymat$from[i]),1:2]),
unlist(fake_df[as.numeric(mymat$to[i]) ,1:2]))
mymat$edist[i] <- edist
mymat$valuenottoonoisy[i] <- rpois(1,exp(3-edist+rnorm(1,0,0.1)))
mymat$valuenoisy[i] <- rpois(1,exp(3-edist+rnorm(1,0,1.25)))
}
words <- as.character(1:n)
data <- mymat
data$from_n <- as.numeric(factor(data$from,words))
data$to_n <- as.numeric(factor(data$to,words))
plot(data$edist,data$valuenottoonoisy)
plot(data$edist,data$valuenoisy)
stan_data <- list(N=nrow(data),
M=length(unique(c(data$from,data$to))),
D=2,
y=data$valuenottoonoisy,
from=data$from_n,
to=data$to_n)
Fitting the data to the not too noisy data.
model <- stan_model(file = "model.stan")
fit <- sampling(model,data=stan_data, chains = 4, cores = 4, iter = 2000,
control=list(max_treedepth=15))
as.data.frame(summary(fit)$summary) %>%
filter(grepl("distance",row.names(.))) %>% pull(Rhat) %>%
summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9991 0.9997 0.9999 1.0000 1.0003 1.0024
That’s good.
stan_data$y <- data$valuenoisy
fit <- sampling(model,data=stan_data, chains = 4, cores = 4, iter = 2000,
control=list(max_treedepth=15))
as.data.frame(summary(fit)$summary) %>%
filter(grepl("distance",row.names(.))) %>% pull(Rhat) %>%
summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9998 1.0604 1.2471 1.5819 1.7158 8.9999
That’s very bad.
stan_data <- read_rds("https://github.com/justinsavoie/personal_website/raw/main/real_data.rds")
fit <- sampling(model,data=stan_data, chains = 4, cores = 4, iter = 2000,
control=list(max_treedepth=15))
as.data.frame(summary(fit)$summary) %>%
filter(grepl("distance",row.names(.))) %>% pull(Rhat) %>%
summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9992 1.0334 1.2037 1.7067 1.8082 17.6371
That’s very bad, also.
I wondered if anyone has advice concerning such a latent space model where the problem isn’t invariance, but rather, the fact that some the data generating process is very noisy. Priors? Different specification of the latent space? What do people generally do to fit a latent space to not very collaborative data? Thanks in advance!