Need help building a two-stage model (beginner)

I’m trying to model several teams competing in a group tournament. I’m fairly new to Stan, and I’m struggling to understand how to build a model beyond simple examples I could find.

Given: number of points won by each team by the end of the tournament (one point for win).
Estimate: relative strength of teams.

This is what I’ve got so far (using rstan package):

stan(
	model_code = "
		data {
			int num_teams;
			int points[num_teams];
		}

		transformed data {
			int outcome[num_teams, num_teams];
		}
		
		parameters {
			vector[num_teams] strength;
		}

		model {
			strength ~ normal(0, 1);

			for (i in 1:num_teams) {
				for (j in 1:num_teams) {
					outcome[i, j] ~ bernoulli_logit(strength[i] - strength[j]);
				}
			}

			for (i in 1:num_teams) {
				points[i] = 0;

				for (j in 1:num_teams) {
					if (i != j) points[i] = points[i] + outcome[i, j]
				}
			}
		}
	",
	data = list(
		num_teams = 4,
		points = c(6, 4, 2, 0)
	)
)

I’m getting this error message:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Cannot assign to variable outside of declaration block; left-hand-side variable origin=data
  error in 'model2294209945cd_ded447960a469e2e3ca50ae670299991' at line 25, column 27
  -------------------------------------------------
    23: 
    24: 			for (i in 1:num_teams) {
    25: 				points[i] = 0;
                                  ^
    26: 
  -------------------------------------------------

PARSER EXPECTED: <expression assignable to left-hand side>
Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'ded447960a469e2e3ca50ae670299991' due to the above error.

Looks like I cannot assign values to whatever comes from data block. But how should I go about building this model then?

Could someone nudge me in the right direction, please?

maybe try:

for (i in 1:num_teams) {
  points[i] = 0;
    for (j in 1:num_teams) { 
       if (i != j) {
          points[i] = points[i] + outcome[i, j];
       }
    }
}

I just try to fix the syntax error and improving readability (my readability)… I don’t know if your model is correct or not…

Thanks, but this doesn’t seem to change the error message:

Cannot assign to variable outside of declaration block; left-hand-side variable origin=data

I cannot understand why you are changing the value of your input data (the variable point ) in the model block. what are you trying to do?
Moreover, in the transformed data block you are just declaring a matrix of int(4x4) without assigning any values to it.
So, on my PC, every value of the matrix elements is set by default (-2147483648)

I’m trying to move away from a simple model:

data { match_outcome }
parameters { team_strength }
model {
  team_strength ~ normal()
  match_outcome ~ bernoulli_logit(team_strength[i] - team_strength[j])
}

to something more sophisticated:

data { total_points }
parameters { team_strength }
model {
  strength ~ normal()
  match_outcome ~ bernoulli_logit(team_strength[i] - team_strength[j])
  total_points ~ sum(match_outcome)
}

I can’t understand how I’m supposed to add that extra step.

Stan requires a log density. It’s not clear from that sketch what log density your’e after, because it’s not clear where match_outcome is coming from. Is that observed or unknown? Also, ~ sum(...) isn’t a distribution. Is that an exact requirement?

Stan doesn’t allow discrete parameters, so if the match outcome is unknown, then it has to be marginalized out.

The “more sophisticated” model I’m trying to build should only have total_points in data block.

The model should output posterior distribution for strength parameter for each team.

The tricky part is that there’s no direct relationship (or, at least, none that I can easily model) between strength and total_points. Instead, I’ve got a relationship between strength and individual match outcomes, with total_points simply being sum of the latter (positive outcome = 1 point). This is what I tried to express as total_points ~ sum(match_outcome), which is, of course, not appropriate syntax.

Basically, I need a model that’s fitted to aggregate counts, and I can’t figure out how to insert that extra step.

Then you need to marginalize out the discrete parameters. That is, sum over all the ways the individual teams can make points that add up to the total if you only observe the total and there’s a latent discrete number of points for each team.

You could also model everything continuously. But you have to deal with variable bounds for different game contributions and there will be a lot of latent parameters.

I think it might help to write down the generative model for what you’re doing. How would you simulate from the model? I don’t understand why you’re counting points, and also modeling a logistic regression for match outcome (that’s basically a Bradley-Terry model in the middle).

Is there an example out there I could use? I’m not sure how to start implementing this in code.

I’ve read Chapter15. Latent Discrete Parameters of the reference manual, but it’s still unclear to me how to approach implementing my model.

I can help you do the marginalization, but you have to meet me in the middle by writing down the density for the model with the latent discrete parameters.

Sorry, I just don’t know how to do it. I don’t even know if it’s possible.

To belatedly answer your previous question:

Having estimated strength of each team (as mean of posterior distribution):

outcome <- matrix(
	rep(0, num_teams^2),
	nrow = num_teams,
	ncol = num_teams
)

for (i in 1:num_teams) {
	for (j in 1:i) {
		if (i != j) {
			# probability of team [i] beating team [j]
			win_probability <- exp(strength[i] - strength[j]) / (exp(strength[i] - strength[j]) + 1)
			
			# simulate outcome (1 point for win, 0 for loss)
			outcome[i, j] <- runif(1) < win_probability
			outcome[j, i] <- 1 - outcome[i, j]
		}
	}
}

points <- rep(0, num_teams)

for (i in 1:num_teams) {
	points[i] <- sum(outcome[i, ])
}

Does this make it a bit clearer what I’m trying to do? Is this something that can be easily modeled in Stan?

No, you can’t calculate means of posteriors within Stan. Stan calculates a log density and then the algorithms on top of Stan do things like sample from the posterior, then with a sample from the posterior, you can compute expectations like means.

I think what you’re trying to do is fit a Bradley-Terry model, which is a piece of cake in Stan. First,

inv_logit(u) = exp(u) / (1 + exp(u))

so you don’t need to do all that yourself.

Here are Stan implementations of Bradley-Terry for a case study.

In particular, you can start with this Stan program for Bradley-Terry that is completely self-contained.

1 Like