Categorical Regression

thowagner · June 15, 2021, 9:54am

Hi,

I’m facing the following problem. I’ve have two categorical variables. Both can have the same categories. I want to measure the covariance between them. My idea was to use categorical regression to do that, with one categorical variable predicting the other.

I’m right now building the model (and I will certainly come back to this thread with questions), but first I would like to know if my approach, using categorical regression, is the right way to answer that question.

Best,
Thorsten

martinmodrak · June 15, 2021, 4:23pm

There’s IMHO nothing wrong with starting by a simple categorical (or more generally multinomial) regression. This can give you an idea how well you can predict one variable from the other, but I don’t think “covariance” is a good term here as the implies the variables are interval measures. In fact, there might be no single “covariance”: let’s say both variables (X1, X2) have three catgories (A,B,C).

The observed frequencies of all combinations are:

X1    X2    count
 A     A      100
 A     B        3
 A     C        4
 B     A        8
 B     B       49
 B     C       52
 C     A       27
 C     B       35
 C     C       28

Now X1 = A gives as very good information about X2 - it will most likely be A, while X1 = B gives us less (that X2 is unlikely to be A) and X1 = C doesn’t tell us much useful about X2 at all! So you can’t really summarise this well in a single number as you could do if the variables were interval or at least ordinal.

If you could elaborate a bit more on what is the underlying scientific/real-world question, then maybe we could see if there is a better approach.

thowagner · June 16, 2021, 10:12am

Hi Martin,

thanks for the response :-)

So what we are trying to do is the following: For our protein of interest, we have multiple sequences of the same protein. We would like to find out the whether two positions are related to each other (hence interacting with each other) by checking how often they co-vary among all the sequences.

I tried to simulate the problem. In the following code I assume that category 1 and 4 co-vary together.

# simulate observation of 500 category pairs
N <- 500
prob_cat_a <- c(1,1,2,1)
pa <- softmax(prob_cat_a[1], prob_cat_a[2], prob_cat_a[3], prob_cat_a[4])

cat_pos_a <- rep(NA,N)
set.seed(34302)
for ( i in 1:N ) cat_pos_a[i] <- sample( 1:4 , size = 1, prob=pa )

cat_pos_b <- rep(NA,N)
for ( i in 1:N ) {
    if (cat_pos_a[i] == 1 ) {
        prob_cat_b <- c(1,1,1,5)
    }
    else if ( cat_pos_a[i] == 4 ) {
        prob_cat_b <- c(5,1,1,1)
    } 
    else {
        prob_cat_b <- c(1,1,1,1)
    }
    
    pb <- softmax(prob_cat_b[1], prob_cat_b[2], prob_cat_b[3], prob_cat_b[4])
    cat_pos_b[i] <- sample( 1:4 , size = 1, prob=pb )
}

I thought I start with indicator variables for all 4 categories, because thats easier for me to understand. Once I got the model running, I want to switch to index variables to make it more flexible. Anyway, the model is not running, here is what I did so far:

code <- "
data{
    int N; // number of pairs
    int K; // number of possible categories
    int cat_pos_a[N]; // outcome
    int posb_1[N]; // predictor indicator for class 1
    int posb_2[N]; // predictor indicator for class 2
    int posb_3[N]; // predictor indicator for class 3
    int posb_4[N]; // predictor indicator for class 4
}

parameters {
    vector[K-1] a; // intercepts
    vector[K-1] b1;
    vector[K-1] b2;
    vector[K-1] b3;
    vector[K-1] b4;

}

model {
    vector[K] p;
    vector[K] s;
    a ~ normal(0, 1);
    b1 ~ normal(0, 5);
    b2 ~ normal(0, 5);
    b3 ~ normal(0, 5);
    b4 ~ normal(0, 5);

    s[1] = a[1] + b1*posb_1 + b2*posb_2 + b3*posb_3 + b4*posb_4;
    s[2] = a[2] + b1*posb_1 + b2*posb_2 + b3*posb_3 + b4*posb_4;
    s[3] = a[3] + b1*posb_1 + b2*posb_2 + b3*posb_3 + b4*posb_4;
    s[4] = 0; // Pivot
    p = softmax( s );
    cat_pos_a ~ categorical( p );
}
"
"

This is how I run it:

dat_list <- list(N=N, 
                 K=4, 
                 cat_pos_a=cat_pos_a, 
                 posb_1=ifelse ( cat_pos_b==1, 1, 0), 
                 posb_2=ifelse ( cat_pos_b==2, 1, 0),
                 posb_3=ifelse ( cat_pos_b==3, 1, 0),
                 posb_4=ifelse ( cat_pos_b==4, 1, 0)
                )
m <- stan(model_code=code, data=dat_list, chains=4)

And this is what I get:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

No matches for: 

  vector * int[]

Available argument signatures for operator*:

  real * real
  vector * real
  row vector * real
  matrix * real
  row vector * vector
  vector * row vector
  matrix * vector
  row vector * matrix
  matrix * matrix
  real * vector
  real * row vector
  real * matrix

No matches for: 

  real + ill formed

Available argument signatures for operator+:

  int + int
  real + real
  vector + vector
  row vector + row vector
  matrix + matrix
  vector + real
  row vector + real
  matrix + real
  real + vector
  real + row vector
  real + matrix
  +int
  +real
  +vector
  +row vector
  +matrix

No matches for: 

  vector * int[]

Available argument signatures for operator*:

  real * real
  vector * real
  row vector * real
  matrix * real
  row vector * vector
  vector * row vector
  matrix * vector
  row vector * matrix
  matrix * matrix
  real * vector
  real * row vector
  real * matrix

No matches for: 

  ill formed + ill formed

Available argument signatures for operator+:

  int + int
  real + real
  vector + vector
  row vector + row vector
  matrix + matrix
  vector + real
  row vector + real
  matrix + real
  real + vector
  real + row vector
  real + matrix
  +int
  +real
  +vector
  +row vector
  +matrix

No matches for: 

  vector * int[]

Available argument signatures for operator*:

  real * real
  vector * real
  row vector * real
  matrix * real
  row vector * vector
  vector * row vector
  matrix * vector
  row vector * matrix
  matrix * matrix
  real * vector
  real * row vector
  real * matrix

No matches for: 

  ill formed + ill formed

Available argument signatures for operator+:

  int + int
  real + real
  vector + vector
  row vector + row vector
  matrix + matrix
  vector + real
  row vector + real
  matrix + real
  real + vector
  real + row vector
  real + matrix
  +int
  +real
  +vector
  +row vector
  +matrix

No matches for: 

  vector * int[]

Available argument signatures for operator*:

  real * real
  vector * real
  row vector * real
  matrix * real
  row vector * vector
  vector * row vector
  matrix * vector
  row vector * matrix
  matrix * matrix
  real * vector
  real * row vector
  real * matrix

No matches for: 

  ill formed + ill formed

Available argument signatures for operator+:

  int + int
  real + real
  vector + vector
  row vector + row vector
  matrix + matrix
  vector + real
  row vector + real
  matrix + real
  real + vector
  real + row vector
  real + matrix
  +int
  +real
  +vector
  +row vector
  +matrix

Expression is ill formed
  error in 'model1bc53051cc07_b9dbde3639292f53cacf9bde5d2a0691' at line 30, column 63
  -------------------------------------------------
    28:     b4 ~ normal(0, 5);
    29: 
    30:     s[1] = a[1] + b1*posb_1 + b2*posb_2 + b3*posb_3 + b4*posb_4;
                                                                      ^
    31:     s[2] = a[2] + b1*posb_1 + b2*posb_2 + b3*posb_3 + b4*posb_4;
  -------------------------------------------------

I’m looking forward to your comments. It seems that I really doing something wrong syntax wise.

As stan novice I really appreciate your willingness to help me :-)

Best,
Thorsten

martinmodrak · June 17, 2021, 11:57am

Hi,
the way you simulated the data would IMHO correspond to a categorical regression. I however think that co-evolution of positions in protein sequences has additional structure beyond that, although categorical regression might be a start.

If you are working with DNA sequence of the protein (which I guess you do as you have 4 categories) you need to take into account that some substitutions are synonymous and don’t actually change to final protein. Wouldn’t working with amino-acid sequence be preferable for this use case?

I only have very brief understanding of the field, but some stuff I would expect to matter:

that co-occurent substitutions are bigger evidence for interactions if the substitutions preserve the (lack of) electrostatic attraction/repulsion
I guess not all of your sequences have exactly the same length, so some sort of multiple-sequence alignment had to be performed before, right? This would also probably convey additional information that could be used.

I am a bit more familiar with some work on RNAs - are your goals similar to the “covariance models” used for mulitple-sequence alignments in RNAs? (e.g. https://academic.oup.com/nar/article/22/11/2079/2400118?login=true) Or are your goals more similar to inferring something about structure from co-evolution of positions e.g. as in https://academic.oup.com/bioinformatics/article/36/10/3072/5729989 ? Or something else?

To your original syntax problem - as the message says, stan does not support multiplications between vectors and arrays. This is because there are two multiplications that could make sense (element-wise product and inner product). To make sure an element-wise product is taken, you need to convert the int array to vector and then use the “element-wise multiplication” operator .*, e.g.

b1 .* to_vector(posb_1)

Alternatively, you may want to declare posb_1 already as vector in the data block and avoid repeating the conversion.

Hope that helps at least a little.

simonbrauer · June 17, 2021, 7:36pm

I suspect that you intend for each observation (i.e. each of N) to have their own vector of K probabilities. However, you specify vector[K] p;, which only has K entries.

Note too that it looks like you’re trying to assign a vector to a single value (barring the int * vector issue that @martinmodrak noted). In the code below, s[1] is a single element of a vector while b1 * posb_1 is a vector multiplied by a vector/int. The dimensions are off in this case, but you obviously can’t assign a longer vector to a single value. (I might add that using the correct data type was one of the biggest hurdles I had when learning Stan, so this sort of error looks very familiar to me!)

s[1] = a[1] + b1*posb_1 + b2*posb_2 + b3*posb_3 + b4*posb_4;

I’ve edited your model block to align with this specification. Note that this isn’t very efficient, but it is the closest way of doing it that maintains your existing code structure.

vector[K] p[N];
vector[K] s[N];
a ~ normal(0, 1);
b1 ~ normal(0, 5);
b2 ~ normal(0, 5);
b3 ~ normal(0, 5);
b4 ~ normal(0, 5);

for(n in 1:N){
    s[n][1] = a[1] + b1[1] * posb_1[n] + b2[1] * posb_2[n] + b3[1] * posb_3[n] + b4[1] * posb_4[n];
    s[n][2] = a[2] + b1[2] * posb_1[n] + b2[2] * posb_2[n] + b3[2] * posb_3[n] + b4[2] * posb_4[n];
    s[n][3] = a[3] + b1[3] * posb_1[n] + b2[3] * posb_2[n] + b3[3] * posb_3[n] + b4[3] * posb_4[n];
    s[n][4] = 0; // Pivot
    p[n] = softmax( s[n] );
    cat_pos_a[n] ~ categorical( p[n] );
}

thowagner · June 18, 2021, 6:17am

Thanks @martinmodrak and @simonbrauer ! That’s super helpful.

I will soon get back to that project (late today or early next week) and give you an extensive answer / feedback.

Thanks again!
Thorsten

thowagner · July 15, 2021, 2:19pm

Hi again!

I finally found the time to work on the problem. First of all, the model is working now. So thanks again @martinmodrak and @simonbrauer . However, now I struggle with interpreting the results. Here is jupyther notebook with my results:

gist.github.com

https://gist.github.com/thorstenwagner/3d95f0bdf73e5da516d44079f2a5d676

catreg.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(rethinking)"
   ]

This file has been truncated. show original

I read that the safest way to interpret the model is to compute the implied predictions. I extracted samples from the posterior and tried to get back the categorical distributions that I used during simulations (see Case A and Case B). However, its not what I was expecting.

For Case A: Why I don’t see an equal distribution here?

For Case B: Why are the probabilities for category 1 that low?

I hope you can help me to clarify this.

Best,
Thorsten

thowagner · July 15, 2021, 2:40pm

Found the problem, the simulation was the wrong way around! Here is now the results, and the results are as expected:

gist.github.com

https://gist.github.com/thorstenwagner/4b69a797b8c1508eb2ee07dc9763f9a4

catreg.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(rethinking)"
   ]

This file has been truncated. show original

Thanks you very much again!

Topic		Replies	Views
Labeling and order of multinomial regression coefficients Modeling	18	711	November 29, 2020
Help with bayesian regression modeling (beginner) Modeling rstan , specification , brms	24	727	February 19, 2024
Can a multinomial regression be condensed? Modeling	7	331	March 31, 2024
Multinomial Logit: probability of choice of a soccer action Modeling pystan , multinomial-response , logistic-regression	24	1986	March 9, 2021
Error-in-variables regression with unobserved discrete predictors Modeling specification	11	1221	March 2, 2018

Categorical Regression

Related topics