The observed values are x_ij and y_ij, which represent measurements from two data modalities. Cell size j is also known and used as offset. All the rest are unknown and need to be estimated. Based on my limited knowledge, I know how to fit a model like the one below:

But this model lacks some flexibility. For example, in the second model, when x_ij=0, y_ij will be 0 for sure. However, in real data, there are cases where y_ij>0 when x_ij=0. This is why I added some additional layers to it and come up with the first model. I can provide more explanations of the additional layers but I have a few general questions:

(1) When writing a model, we can use as many layers as we want as long as we think that is the data generative process. However, how do we know if the parameters in the model can be estimated, especially using Stan?
(2) Is there anywhere to find Stan code for complex multilevel models? Iâd like to learn from real examples on how to actually specify and fit this kind of models. Most of the models I found have at most 3 layers like the second model I posted here. I am specifically interested in models with layers that contains indicator functions to provide a mixture of two processes.
(3) Can the first model be implemented in brms, or I have to use rstan directly?

Thatâs usually a very hard question. There are some theoretical results, but for the most part, you need to verify your ability to fit the model empirically (i.e. try and see if it works). I would generally recommend a workflow following [2011.01808] Bayesian Workflow where you start with a simple model and gradually add complexity as needed. Building a huge model at once is highly likely to introduce bugs. Inability to fit the model given the data you have will manifests as divergences or other warnings when fitting the model and/or posterior uncertainty that is too wide to be useful.

Thank you very much for your reply! Everything makes a lot of sense. I have been trying to implement it using Stan and marginalization seems to work as expected.

I just have one more question regarding the model. The reason why I formulate the model in this way is because I want to be able to generate non-zero y_ij given zero x_ij (the simpler model will have y_ij=0 for sure if x_ij=0. The indicator function layer (4th and 5th row of the more complex model) can be simply considered as to learn a probability of generating non-zero y_ij given zero x_ij. So in order to fit this model, training data should contain data points where x_ij=0 but y_ij>0. However, Stan is based on log probability. When x_ij=0, I will have log(0) in the log probability and this will cause initialization errors. So I was just wondering if there is a way to still keep these data points (x_ij=0 but y_ij>0) and use them for model training without causing initialization errors? Something like adding a very small value to avoid log(0)?

If I understand the model correctly, I think the problem is unrelated to Stan using log probability - you are using \ln(x_{ij}) as a predictor and thus implicitly assume x_{ij} > 0 (regardless of how you fit the model). There is a sensible extension to the situation that you allow x_{ij} = 0 and interpret it as \ln(\mu_{ij}) = -\infty and thus \mu_{ij} = 0 which implies y_{ij} = 0. If you indeed want to allow other behavoir, it implies your modelling assumptions are not consistent with your domain knowledge and you need to modify the model in some way. Adding small number to x_{ij} is a hack (which might be OK in some cases), but there might be better options.

An example on what this might mean: Presumably, x_{ij} are counts, potentially obtained by sequencing. If this is the case, you are probably using x_{ij} as a proxy for the unobserved expression level in the cell - where the expression level (not the sequenced molecule count) is the predictor you actually want to use. In such situation, you may benefit from treating this as a measurement error model - you would introduce additional parameters \lambda_{ij} and have something like x_{ij} \sim NB(\exp(\lambda_{ij}), \phi_{i}). You would then have \ln(\mu_{ij}) = \alpha + \alpha_i + \lambda_{ij} + \ln(cellsize_j). Such models are computationally expensive, but are theoretically appealing - if computation is an issue you may than try to figure out an efficient way to approximate this theoretically appealing model.

There is a sensible extension to the situation that you allow x_{ij}=0 and interpret it as ln(ÎŒ_{ij})=ââ and thus ÎŒ_{ij}=0 which implies y_{ij}=0.

This is exactly what I want the model to behave. But when I include data point with x_{ij}=0, the sampler wonât work because there will be log probability equal to log(0). It seems that there is not a way to walk around it given the current model.

Regarding the model you proposed, I think this is a great suggestion. But it is a bit tricky in my situation. I would love to hear your advice. This is not really related to Stan in specific so I donât want to take too much of your time. But if you have time to take a look, any suggestion is appreciated! Here is what I am trying to do:

x and y are actually measurement of the same set of genesâ expression using two technologies (letâs call them A and B). I am trying to predict y from x because in practice, it is a lot easier to get x using technology B than y using technology A. In my current model, x_{ij} is actually treated as the true expression of gene i in cell j, i.e., the \lambda_{ij} you proposed. It is a ratio between 0 and 1 instead of count (y is count). However, because x_{ij} is not \lambda_{ij}, I am adding all the additional layers to kind of try to guess the \mu_{ij} generated by an unknown \lambda_{ij} from the \mu^{T}_{ij} generated by x_{ij}. But I think your strategy is more clear in terms of logic. Basically, there are two measurement error model: x and y (now we consider both x and y are count data) are generated in similar ways from a common hidden variable representing true gene expression level, i.e., \lambda_{ij}, separately. However, to be more precise, the two technologies are not carried out on the same set of cells. So x and y actually have different number of cells and there is no one-to-one match between them. I am using average gene expression of cells from the same cell type to bridge the two datasets. This way, the common hidden variable actually represents true gene expression level of gene i in cell type k. Given all these information, I wrote a model like this:

\lambda_{ik} is the hidden variable that generates both x and y through similar generating process with different parameters. Does this model sound reasonable to you? If yes, then my question is, is it really possible to fit this model? Basically, is it possible to estimate all the hidden variables given only x, y and the cell size information? This doesnât seem to be the typical kind of models that I have seen before. Eventually my goal is to have a model that can predict y just from x.

Indeed, infinities almost always cause problems in Stan models. In this specific case however, the data points with x_{ij} = 0 cannot possibly provide any useful information for the model and can thus be ignored. The easy way to see that is that if x_{ij} = 0, then P(y_{ij} = 0) = 1regardless of all other model parameters. More precisely: if x_{ij} = 0 and y_{ij} = 0 the log likelikehood for y_{ij} is zero and the gradient of the likelihood wrt. all parameters is also zero, so the posterior computed by Stan will be the same whether we include or donât include those data points. If however x_{ij} = 0 and y_{ij} \neq 0, the log likelihood is negative infinity (i.e. the model is contradicted by the data), once again regardless of all other model parameters.

If I read the model correctly, it says that if cell size is fixed, then for a given gene, all cell types exhibit the same multiplicative difference between the average expression measured by A and average expression measured by B. The model IMHO could be rephrased as a simple linear model with predictors for both mean and dispersion (basically \lambda_{ik} are âfixed effectsâ of gene x cell type interaction, \alpha^A is the âfixed effectâ for technology A, \alpha_i^{A,B} are âfixed effectsâ of technology x gene interaction).

This might and might not be a good model of the actual process and you would need to verify that by using theoretical background and examining the fit to data. I donât know which technologies you compare, but to me, your focus on gene-specific differences between the technologies seems somewhat implausible. I would a priori expect most of the differences between the technologies to be associated with the actual expression level (e.g. the technologies likely have different lower limit for detection; some technologies like fluorescence might âsaturateâ and have reduced resolution for very high expression levels; etc.), the length of the transcript and possibly some aggregate characteristics like GC content. There might still be gene-specific differences once you factor that in, but I would expect those to be mostly small.

So I think you might be better served by approaching the problem more like an attempt to find a calibration curve (that might somewhat differ with properties of the gene). But you obviously understand the problem and relevant background knowledge better, so I might be missing some important consideration.

This actually doesnât look like a particularly terrible problem - itâs just a large linear regression. Since you treat the \lambda_{ik} as fixed effects, the main consideration is the number of replicates per cell type and technology. If it is sufficient (in combination with your priors) to inform all the model parameters, I donât see any reason why this model would be too problematic as long as the number of genes is no too high (a quick rough guess: with hundreds of genes, this could still be relatively quick, with low thousands of genes it would take couple hours/days but still be feasible)

Indeed, infinities almost always cause problems in Stan models. In this specific case however, the data points with xij=0 cannot possibly provide any useful information for the model and can thus be ignored. The easy way to see that is that if x_{ij}=0 , then P(y_{ij}=0)=1regardless of all other model parameters . More precisely: if x_{ij}=0 and y_{ij}=0 the log likelikehood for yij is zero and the gradient of the likelihood wrt. all parameters is also zero, so the posterior computed by Stan will be the same whether we include or donât include those data points. If however x_{ij}=0 and y_{ij}â 0 , the log likelihood is negative infinity (i.e. the model is contradicted by the data), once again regardless of all other model parameters.

Thanks for this super clear explanation! Now I understand what is going on when I have x_{ij}=0 now!

If I read the model correctly, it says that if cell size is fixed, then for a given gene, all cell types exhibit the same multiplicative difference between the average expression measured by A and average expression measured by B. The model IMHO could be rephrased as a simple linear model with predictors for both mean and dispersion (basically \lambda_{ik} are âfixed effectsâ of gene x cell type interaction, \alpha^{A} is the âfixed effectâ for technology A, \alpha^{A,B}_{i} are âfixed effectsâ of technology x gene interaction).

I think your understanding is correct. What the model is trying to say is that there is a true level of expression for each gene in each cell type. What we observed by different technologies is dependent on this true value, the sequencing depth (for sequencing based method), and an additional layer of gene specific âdistortionâ. The âdistortionâ here is represented by a gene specific multiplicative deviation from the true expression level, which is the \alpha^{A,B}_{i} in the model. I kind of understand what you mean by a simple linear model conceptually but not quite sure of how to rephrase the model exactly. I have a feeling that this is important for my understanding of the current model. Please forgive my ignorance. Do you mean something like this: \mu=\alpha+\beta_{1} \times gene + \beta_{2} \times technology + \beta_{3} \times gene \times cell.type + \beta_{3} \times gene \times technology + \epsilon ?

In my specific case, the two technologies are single cell sequencing and single molecule FISH. Based on my knowledge, gene specific multiplicative difference is what people use. But you are absolutely right that there are many other factors. In fact, the factors you mentioned probably all play a role in this technology difference. I have checked GC content and gene length and their effect is minor but gene expression is definitely the most important factor. That is why I have it in all the different versions of the model. But when you say âmost of the differences betweenthe technologies to be associated with the actual expression levelâ, do you mean having another layer that links \alpha_{i}^{A,B} with average expression of gene i? A more general question is that, we donât have a clear understanding of how exactly gene expression quantification differs between technologies. So I am kind of lost in a sense that I donât know if I should keep adding more layers to the model to make it more complicated or should I ignore certain thingsâŠ

In the real data, when I compare the two datasets from A and B, they donât agree very well, meaning that for two genes in two cell types, if there expression levels are similar based on technology A, their expression levels in technology B can be similar or very different. The variation is actually quite high. This is why I am trying the Bayesian model to measure the uncertainty (posterior distribution of the \sigma_{platform}). If I understand correctly, linear regression or calibration curve doesnât have this feature?

This actually doesnât look like a particularly terrible problem - itâs just a large linear regression. Since you treat the Î»ik as fixed effects, the main consideration is the number of replicates per cell type and technology. If it is sufficient (in combination with your priors) to inform all the model parameters, I donât see any reason why this model would be too problematic as long as the number of genes is no too high (a quick rough guess: with hundreds of genes, this could still be relatively quick, with low thousands of genes it would take couple hours/days but still be feasible)

I guess the reason why I ask if this model is easy to fit or not is because in most of the models that I see, the predictors (x_{ij} in my model) are always on the right hand side. But in the last model, it is on the left hand side and it is assumed to be randomly sampled from a negative binomial distribution. I guess when I have x_{ij} and y_{ij} and just want to fit the model to estimate all the hidden variables, I should be able to do that. However, I am not sure how it would work for generating new data. Specifically, like I mentioned earlier, the two technologies are single cell sequencing and single molecule FISH. In reality, single molecule FISH can only measure fewer than 100 genes. One goal of fitting a model like this is that I can learn the general rules connecting the two data modalities by fitting a generative model using these 100 genes and then use this model to predict the gene expression of other genes in single molecule FISH based on their expression in single cell sequencing. This means that the input for generating data will be different from the set of genes used to training the model. For this generative process, x_{ij} and cell size are the only input. In the old models (the two models in my first post), I can simply plug in these two factors and calculate \mu_{ij} and \theta_{ij} and then draw samples from the ZINB distribution (\alpha_{i} is randomly drawn from the estimated normal distribution). But in the last model, because x_{ij} is on the left hand side, how is it used to generate y_{ij}? It seems that \mu_{ik}^{B} and \theta_{ik}^{B} needs to be estimated somehow from x_{ij} first? After having \mu_{ik}^{B} and \theta_{ik}^{B}, I can see how to get \mu_{ik}^{A} and \theta_{ik}^{A} (actually we donât even need to know \lambda_{ik} because it is cancelled out) but if I need to estimate \mu_{ik}^{B} and \theta_{ik}^{B} every time when I make a prediction, then it doesnât seem correct. Prediction should not involve any estimation right? This means I must have misunderstood some parts. This model is the most logically clear to me and I can keep all the data with x_{ij}=0. However, in practice, I will use this model to generate data many many times so if an estimation is needed, I wonât be able to use this model or models with similar logic (treating both x and y as count generated from the same true expression value) thenâŠ

I think this potentially reveals a fundamental problem which should be resolved before delving into details on model implementation. Iâve never analysed FISH data and have only superficial understanding of the method, so please correct me if I am wrong, but my understanding is that FISH will give you absolute quantification. On the other hand, scRNA-seq is going to give you relative quantification.

For a specific example, letâs imagine you measure twice as many cells, but keep the experiment otherwise the same. Since the scRNA-seq reads would split over twice as many cells, youâll get on average twice lower read counts per cell. On the other hand, FISH should report the same expression levels in all cells. Or letâs imagine that you have the same number of cells, but all cells increase their metabolism and on average all genes are expressed 1.5 times more (or have the same expression but you do a better sample preparation, so more RNA is retained): in this case, youâll get roughly the same read counts as the sequencing depth was kept the same, but FISH should show the increase. Does that sound correct?

If I am right, than it is not impossible that the discrepancies you observe are primarily driven by this relative vs. absolute mismatch. Or do you already correct for it in some way? Unless you have some good way to anchor the scRNA-seq data and thus make it possible to get absolute quantification (e.g. via some sort of spike-in at cell level), you need to first make both datasets measure the expression relative to the same baseline. One could probably use the total counts of all genes you have FISH data for to normalize the scRNAseq data (instead of normalizing against all genes which is usually the default) and then also normalize the FISH data by the total expression observed in FISH.

In general, compared to scRNA-seq, FISH data is more on the absolute side. However, based on my knowledge, its quantification can still be limited by technical issues like optical crowding (if there are too many cells or RNA molecules, the fluorescence signals will pile up, which will affect the segmentation quantification). Due to this reason, people usually donât use highly expressed genes in FISH because they will dominate the signals. So in the extreme case, the number of detected RNA molecules may not go linearly with the actual level of gene expression but will saturate. I am not sure if this will make the FISH quantification not absolute.

In my specific case, when I say the quantification from the two technologies donât agree with each other, what I compared is relative expression instead of the absolute number of count. And the relative expression is calculated using the same set of genes. For example, if I have 200 genes measured by both technologies, then the relative expression is the number of counts/molecules of one gene divided by the total number of counts/molecules from the 200 genes. Therefore, this should not be driven by the relative vs. absolute mismatch. Basically, if we go back to your example, if we measure twice as many cells, the total number of read counts per cell using scRNA-seq will be half but the relative expression of each gene will stay the same if there is no change in the experiment condition. In my last model, I kind of consider this relative expression of a gene in a cell type as the hidden true expression level (\lambda_{ik}). And the relative expression from scRNA-seq data and FISH data is a distorted version given the hidden true expression level (e^{\alpha_{i}^{B}}\times\lambda_{ik} and e^{\alpha^{A}+\alpha_{i}^{A}}\times\lambda_{ik}). These two ratios are multiplied by the cell size variable to get the mean expression level (\mu_{ik}^{A} and \mu_{ik}^{B}). The cell size variable for scRNA-seq basically represents the sequencing depth and the cell size variable for FISH represents an empirical number of molecules observed for each cell. Based on my understanding, by explicitly modeling the data generating process for the two technologies in this way starting from the same true relative expression level, we should be good for the relative vs. absolute problem you mentioned. I am not sure if this sounds reasonable to you. I feel like this is the most logical way to me but as I asked in my previous reply, once the model is fitted, making predictions from new x seems to be a non-trivial thing. It seems that \mu_{ik}^{B} and \theta_{ik}^{B} needs be estimated from new x first before making predictions. Did I misunderstand this process?

Great that you already took care of this, I just wanted to make sure we are not missing a huge issue before delving into smaller details.

There is a potential for further (IMHO minor) complication in that for N genes, relative expression has only N - 1 degrees of freedom and there as in implied negative correlation structure across genes (increase in one relative expression must necessarily reduce the relative expression of another), so one may want to model the outcomes as something based on multinomial distribution (for counts) or dirichlet distribution (for continuous values). This adds an additional layer of complexity, but with N = 200, I think that the implied negative correlations are not such a big deal and can be mostly ignored (I did some inconclusive initial experiments in that direction and couldnât get noticeable improvement from using multinomial models for RNA-seq data). But just wanted to put that out there :-) You may also consider modelling the true relative expression as a simplex.

No worries, this stuff cen be hard! Roughly yes. Just to be very specific, if I start with this part of your model:

This is definitely a linear model! If we ignore priors, and reshuffle the coding and coefficients a bit, this is IMHO identical to R formula ~ 0 + technology + gene:technology + gene:cell_type where the technology term corresponds to (1-T)\alpha^A, gene:technology to (1-T)\alpha_i^A + T\alpha_i^B and gene:cell_type to \ln \lambda_{ik}. The cell size is just an offset term, so not part of the formula proper.

With that said, the lack of intercept and a richer model similar to what you wrote might be more sensible.

Does that make sense?

I didnât really mean more layers, just having a more flexible than linear relationship between the two relative expression levels, especially in the light of:

So instead of assuming a linear relationship, you may want to fit something like a spline, to allow for something like

or any other weird relationship there might be (this is one possible âcalibration curveâ Iâve been speaking about).

Just to be clear, calibration curves and linear regression are just types of models one can fit in either frequentist or Bayesian paradigm. So designing a model and then doing inference are two task that need not be done together (although choice of paradigm and software limits which models you can do sensible inference for).

Thatâs an important point. Indeed, the model as I proposed does not leave you with a direct way to plug in data from one technology and predict the other. One would actually need to do estimation for predction. However, there is actually no rule that says âprediction should not involve estimationâ, so weâre good :-) I admit I have only some theoretical knowledge about this types of problems and have never implemented it myself, so please take some caution with the following advice.

The safest way to do this type of prediction is AFAIK actually to fit a model that includes both your training data and then the single technology data you want to predict for - from such a big model, you should be able to make predictions for the other technology directly. There are however many reasons (in particular because this may be computationally were expensive) why this might not be practical and you may want to take shortcuts. But even when you want to take shortcutes, I find it helpful to keep this ideal case in mind and try to find some approximations to this.

To be specific, one could fit for the training data once and then only do some simple fit per gene in the new data, to get a distribution of plausible \lambda values, and then combine those with the posterior from training data.

One may even go a step further and take the data from one technology directly as predictors (ignoring the measurement error, replacing zeros with small numbers) as you suggested in your initial model. However to approximate the behaviour of the ideal model, we should keep in mind, that we expect greater posterior uncertainty about FISH expression for low read counts (as those constrain \lambda much less) than for high counts. So if you do this, you should definitely allow the ânoiseâ in the response (FISH data) depend on the read count!

Additional reason why some sort of inference might be needed in the prediction step is that you actually donât model the relationship between expression measured by two technologies in the same cell, but between set of expression levels measured in the same cell type. I.e. to match your model, you cannot plug a single read count - instead you IMHO should take a set of replicates of RNA-seq and use those to infer the âaverage RNA-seq expression in this cell typeâ and then predict either âaverage FISH expresion in this cell typeâ or âset of replicates of FISH measurements in this cell typeâ.

Just to make sure I understand this correctly, do you suggest to model the count data using a multinomial distribution rather than negative binomial distribution? Also what do you mean exactly by âmodelling the true relative expression as a simplexâ?

I see. Now I think I understand what you mean by a linear model. My brain is getting slow right now so this may be a stupid question. Given this linear regression formulation, do you think I could fit the whole thing using linear regression rather than going through a full Bayesian framework?

Maybe the fit is not good. The estimated parameters eventually give a linear shape instead of a S shape. I guess the non-linear relationship is not so strong in the data although theoretically it should be there. Or the estimation is messed up by other stuff. I will look into this more.

I have actually fitted the last model I proposed in this post (the one you said can be reformulated in a linear regression fashion). Surprisingly, the model fit is really good. The posterior predictive check looks good. All other aspects that I care about are also very good. So this seems like a good model fit on the training data. However, when I think about making predictions, I realized it is harder than I thought. Basically during model training, what is used are the genes that show up in both datasets and cell types show up in both datasets. However, when I make predictions, the predictions will be made on new genes. These genes donât have spatial measurement (i.e., y_{ij} in the model). So I cannot even fit the current model, let along making predictions. One compromise I can take is to just fit the second half of the model using scRNA-seq counts of new genes (i.e., x_{ij}) to estimate \lambda_ik first:

However, without constraints from the first half of the model, the estimated \lambda_{ik} using just the second half of the model can be very different from the estimates using the full models (if we have their corresponding y_{ij}). For example, for one gene in one cell type, assume I have 1000 cells in scRNA-seq and they all have 0 count. But I have 200 cells from FISH data for the same gene in the same cell type that they all have non-zero count due to high sensitivity. If I am fitting just second half of the model, the most likely \mu_{ik}^{B} is 0, which makes \lambda_{ik}=0. However, if I fit the full model, because the FISH measurement is non-zero, these non-zero y_{ij} will force the estimated \lambda_{ik} to be non-zero because it is shared in the first and second half of the model. In real world situations, when I make predictions, I only have x_ij values that are not seen in the training data. Do you know if there is any way to robustly estimate their \lambda_{ik}? If I can do that, then I can predict their corresponding y_ij by taking their estimated \lambda_{ik} and plug them into the first half of the model. I guess this is the most important question to me now. If I can do it, everything will be solvedâŠ

I am currently taking care of this by requiring the dispersion parameter to be dependent on the expression:

Higher expression will lead to lower dispersion (small additional variance). Do you have suggestions on other ways to control for this? I still see quite some variation when the expression is high. This may be due to the fact that the random effect is NOT dependent on gene expression so that highly expressed genes may still have high random effect that is not supposed to happen.

Yes. In my current model, only the count are cell based. All the other parameters are either gene based (like \alpha_{i}) or gene and cell type based (like \lambda_{ik}). I am using these cell type average value as predictions.

Linear regression is a family of models. For any given model, there are many ways to fit it, including frequentist (often - but not always - maximum likelihood) and Bayesian and various approximations to either type. If the linear regression seems like a good model, it might be the case that say a maximum likelihood estimator proves sufficient (this tends to be the case when you have enough data to inform the model well, i.e. when the asymptotic properties motivating maximum likelihood methods can be relid on).

I wouldnât say I âsuggestâ using a multinomial distribution - I tried it and could not make it work well. I just wanted to note that theoretically the sequencing process is fundamentally multinomial. It might often be safe to ignore it, but one should be aware of what one is ignoring :-).

Modelling the relative expression as simplex, means keeping the negative binomial noise model, but remove one degree of freedom for the \lambda values. This could for example mean treating them as simplex (i.e. enforcing they sum to 1 for each sample).

I didnât want to suggest a specific curve shape (e.g. a sigmoid as the one you used), that was just an illustration, sorry for any confusion. I thing fitting a flexible curve family (a spline or a Gaussian process) is much more sensible - I donât think you can a priori assume the relationship is going to be sigmoid-like.

I think this really boils down to specifying an underlying model that matches reality well. In fact, all zero counts should generally just provide some sort of upper bound on expression, not make the respective \lambda counts zero.