Distance matrix regression


#1

I am curious how one might set up a way to use matrices (of the same dimension) as response and predictor variables in brms.

I’ve attempted to do this by converting each matrix into a vector, but using row and column position in the original matrix as random effects. This works well for non-symmetric matrices, but I’m not sure how one would handle syntax when the matrix is symmetric, like a distance matrix.


#2

On a technical level, you can use matrices as elements of a data.frame and then pass to brms:

df <- data.frame(y = rnorm(100))
df$A <- matrix(rnorm(300), ncol = 3)
fit <- brm(y ~ A, data = df)

but I got the feeling this is not what you have in mind. In this case, could you try to precise your question?


#3

Yes, the issue is not the technical level of inputing a matrix, it is ensuring that the structure of the distance matrix is accounted for in the model.

Here is a toy example with the iris dataset:

data(iris)
#create distance matrix from first two principal components
iris.dist<-as.matrix(dist(prcomp(iris[,1:2],scale=T)$x[,1:2]))
#scale to mean 0, SD 1
iris.dist<-iris.dist-mean(iris.dist)
iris.dist<-iris.dist/sd(iris.dist)

iris.dat <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(iris.dat) <- c("Distance","RowID", "ColumnID", "SameSpecies")
for(i in 1:nrow(iris)) {
  for(j in 1:nrow(iris)) {
    if(i!=j) { #do not include distance if same individual is chosen
      iris.dat<-rbind(iris.dat,data.frame(
        Distance=iris.dist[i,j], # distance
        RowID=as.character(i), # row ID
        ColumnID=as.character(j), # column ID
        SameSpecies=as.numeric(iris$Species[i]==iris$Species[j])
        #are the two individuals of the same species?
      ))
    }
  }
}

# do individuals of the same species have a smaller Euclidean distance?
brm.iris<-brm(Distance~SameSpecies+(1|RowID)+(1|ColumnID),
                    family=gaussian,
                    data=iris.dat,cores=4, inits = 0,
                    prior=c(prior("normal(0,2)", "b")))
summary(brm.iris)

The problem here is that each distance appears twice (ie. 2,5 and 5,2), but the RowID and ColumnID variables do not fully capture the dependancies between variables without this repetition.


#4

A distance matrix is also going to

  • be symmetric,
  • have non-negative entries,
  • have zero diagonals, and
  • satisfy the triangle inequality, d(a,b) + d(b,c) \geq d(a, c).

I’m not exactly sure what you mean by taking that structure into account. It looks like in your model that the distace is the observation in a regression, so presumably you’re looking for some structure in those random effects so that you get the same result by switching the row and column IDs. Given the way you set up the regression as SameSpecies + (1 | RowID) + (1 | ColID), if you swap row and column IDs, you’ll get the same prediction, becuase the SameSpecies will be the same.


#5

I suggest using only the half of the distance matrix (lower the diagonal, say) and then use a multimembership grouping term to account for the fact that rows and Columns refer to the same set of locations:

Distance ~ SameSpecies + (1 | mm(RowID, ColumnID))

See also https://journal.r-project.org/archive/2018/RJ-2018-017/index.html for more details about multimembership terms.


#6

That multimembership thing is neat!

I don’t think we’re quite there, yet though. Won’t this formula still wind up counting Distance[i, j] and Distance[j, i] as two observations rather than one, and counting Distance[i, i] as an observation even though we know structurally the result has to be zero in that case?


#7

A multi-membership grouping term would indeed solve the problem! Perfect, thank you Paul!

This will let me account for the fact that for a hypothetical set of three points X, Y, and Z, the distance between point X and Y is not completely independent of the distance between X and Z or the distance between Y and Z.

By using the multi-member grouping term, I think STAN/brms could be a potential alternative to the exponential random graph models common in social network analysis, in addition to performing multivariate distance matrix regression (MDMR).


#8

@Bob_Carpenter That’s why I am saying one needs to use the lower triangular part of the matrix, only.

@Zacco Glad to hear multi-membership are helpful to you :-)


#9

I think I can avoid that by removing the duplicated terms. The code would now look like this:

data(iris)
#create distance matrix from first two principal components
iris.dist<-as.matrix(dist(prcomp(iris[,1:2],scale=T)$x[,1:2]))
#scale to mean 0, SD 1
iris.dist<-iris.dist-mean(iris.dist)
iris.dist<-iris.dist/sd(iris.dist)

iris.dat <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(iris.dat) <- c(“Distance”,“RowID”, “ColumnID”, “SameSpecies”)
for(i in 1:nrow(iris)) {
for(j in 1:nrow(iris)) {
if(i<j) { #only include the lower diagonal of the distance matrix
iris.dat<-rbind(iris.dat,data.frame(
Distance=iris.dist[i,j], # distance
RowID=as.character(i), # row ID
ColumnID=as.character(j), # column ID
SameSpecies=as.numeric(iris$Species[i]==iris$Species[j])
#are the two individuals of the same species?
))
}
}
}

brm.iris<-brm(Distance~SameSpecies+(1|mm(RowID,ColumnID)),
family=gaussian,
data=iris.dat,cores=4, inits = 0,
prior=c(prior(“normal(0,2)”, “b”)))
summary(brm.iris)


#10

Thanks for being patient—I mised that. @Zacco’s example cleared up what you meant.

@Zacco, feel free to mark your last post or @paul.buerkner’s as a solution.