Constraining predictions of matrices to sums of categories of rows and columns

Hello everyone,

I am completely new to the community and a novice in running Bayesian models. My question pertains to a set of matrices that I have collected and standardized from across many countries and years. Each matrix represents the total health spending in a given country year and is broken-down by the two dimensions - health spending by types of providers (~20 columns) and health spending by types of services (~20 rows). The sum of the rows must equal the sum of the columns. These matrices have varying levels of completeness (ranging form only totals - where cross-tabulations are completely empty - to nearly complete cross-tabulations).

Using the rstanarm package, I am currently able to predict cell-specific estimates across the countries and years, however, I am not sure how to incorporate the constraints that the sum of the rows must equal the sum of the columns. Has anyone added similar constraints to their models?

Thanks so much,
Matt

You would have to write your own Stan code to enforce a constraint such as that in the log-likelihood.

Thanks for the response, Ben. If I wanted to write my own Stan code that would allow for this constraint as well as the functionality from stan_glmer would you have some recommendations for starting places for me (example code, etc)?

In the end I will want to have country and time effects (possibly random effects) in addition to covariates such as gross domestic product.

I would start with brms::make_stancode(formula, data). I am not totally sure how you want to do your model, but you could have the conditional mean of the rows be a function of the covariates and the conditional mean of the columns be a scaled simplex where the scaling factor is the sum of the conditional means for each row.

Thanks, Ben. This sounds really promising. I’ve been reviewing brms::make_stancode(formula, data) and the simplex {boot} function in R. However, not having a strong CS or math background (mostly biostats) this leaves me with many questions. I wonder with the below example of my previous model and data could you add a bit more detail?

Variable descriptions:
country = country name
c_num = numeric country value
year = year of data
sha_hp = health provider category
sha_hp_num = numeric health provider category
sha_hc = health function (services) category
sha_hc_num = numeric health function (services) category
ln_gdp_pc = logged GDP per capita (country-year specific)
value_pc = health expenditures (USD) in per captia space of specific country-year provider-function

My previous rstanarm model used was:

model_1 <- ln_value_pc ~ ln_gdp_pc + (1|c_num) + (1|sha_hc_num) + (1|sha_hp_num)

stan_glmer(formula = model_1, data)

The ppc distributions looked like this:

ppc

The images are missing

Sorry about that Ben. I’m trying to figure out why they didn’t and how I can get them to upload.

Okay, let me know if you can see them now.

OK, I can see them now. With these variables, what is supposed to equal the sum of something else?

The data head I showed is in the long format. So for a given country-year (Afghanistan 2008), there will be expenditures reported by health providers (sha_hp) which would sum across all variants of health functions (sha_hc). So the category sha_hp = 1 would have expenditure values across sha_hc = 1 to 9; the sum of these would equal the total for sha_hp = 1. The same would hold for any category of sha_hp, summing across the categories of sha_hc. Then the sum of the totals of sha_hp’s must equal the sum of total sha_hc’s, which is the total health expenditures in that given country-year.

I hope this makes sense, but if not, the original data that was compiled is a table (or matrix) for a specific country-year where rows are the health functions, the columns are the health providers, and the margins are the totals, which sum to the total health expenditure.

Hi Ben,

Just wanted to follow up for any thoughts you might have with the information I’ve provided about my data? Particularly with your point below:

start with brms::make_stancode(formula, data) . I am not totally sure how you want to do your model, but you could have the conditional mean of the rows be a function of the covariates and the conditional mean of the columns be a scaled simplex where the scaling factor is the sum of the conditional means for each row.

So, in the .png you put up above, you have some rows of data in the long format that are sums of the other rows on health expenditures?

Currently, yes. However, I can reformat the data in whatever way works. Perhaps showing you an example of one of the raw matrices will help. Below you will see that the columns are health care providers and rows are health care functions, please ignore the sub-components of the categories (i.e. HC 1.1, HP 1.1) as I am only worried about making estimates are the first level (i.e. HC 1 and HP 1). The sum of the columns, equal the sum of the rows (1,277,281 million). I have, roughly 800 country-years worth of data with matrices with similar detail and roughly 600 where we only have the sums.