Modeling probability of action as a function of spatial distance

I’m trying to use rstan or brms model how people interact with pins on a map, like the one below that shows pizza restaurants in the city of Chicago:

I’m envisioning people interacting with this map in terms of “moves”. The first pin they interact with is the first “move”, where they go next is the second “move”, and so on.

What I’d specifically like to model is (1) the probability that any given pin is chosen on the. first move by participants and (2) given that a person choice a pin on the first move, what is the probability that any given pin is selected on the second move.

For this second question, I’m thinking that I need to somehow account for the spatial distances between pins. For example, if someone interacts with the map pin for Five Squared Pizza in the center of the map on move 1, I anticipate they are more likely to search close by pins on subsequent moves. Being able to account for spatial relationships between pins seems key to testing this hypothesis.

My question: Does anyone know of modeling strategies that can be helpful for answering these questions? I don’t really know where to start on the modeling side, and again, I’d like to move forward with modeling this data using rstan or brms.

I would really, really appreciate some help.

In terms of generic strategies, I think start small. So this was the first problem you described:

You could also simplify the problem to a single Pizza place vs. everyone else, so instead of selecting from a list, a yes-no choice.

If you can figure out things that are useful for the yes-no choice, then that’ll make your life easier when you go to the more complicated models.

Is it true though that people go from one pizza place and immediately go to a 2nd and a 3rd?

If you have a time series model, you can try to connect things together. But for starters, you might get somewhere working with the jumps separately. Like, model the first place people go. Then take that a model the second place people go and consider adding in as a covariate the first place they were.

Good idea, that’s a starting point that I can pursue.

It looks like it’s generally true that people go to a 2nd and a 3rd. However, this isn’t always true. For example, some people jump back to the 1st location after looking after the 2nd. This also brings up the point that the number of places that people visit isn’t the same either. Some people visit a few locations, whereas others visit many.

I haven’t worked with time series before, though is sounds like this is what I need to connect moves together. Any resources other than this bit of the Stan manual that you’d recommend?

Also, I like the idea of trying to model the first step with one model, then model the second step as another model (with first location as a covariate). Seems like a simple starting point I can pursue.

Oh is this visit in the clicked-the-link sense? Or visit in the went-and-ate-pizza sense?

You can also try to estimate how many places someone will check and then which places they are likely to check in separate models. Maybe these phenomena are a bit different.

In the “clicked-the-link” sense. I’m trying to model clickstream data.

In movement ecology, this is exactly the use case for so-called “resource selection functions” or “step selection functions”. I’m a bit rusty on these, and I suggest Peter Turchin’s book “Quantitative analysis of movement” for more.

IIRC, the basic idea is to fit a glm to model (something proportional to) the probability of taking each “step” as a function of its attributes (both attributes of the destination and attributes of the step itself like the distance, the cosine of the turning angle, or interactions between attributes of the current location and the destination), under the constraint that the sum of the probabilities for all possible steps must be equal to 1. (Note that this population of all possible steps changes from step to step, since a step cannot end up back where it started, and also that the covariates associated with a destination–like distance or turning angle–also change). Notice that because of the normalizing constraint, the family of link functions that are available is quite large, since inverse links don’t necessarily need to return values on the unit interval.

Edited to clarify: IIRC, the idea isn’t to model the probabilities under the constraint that they sum to 1, but rather to model something proportional to the probability as a function of covariates, and then normalize these values to use as probabilities in the categorical sampling of the steps themselves. The function of covariates can be an arbitrary function that returns strictly nonnegative values and always returns at least one strictly positive value.


Cool beans. Those seemed like very pizza-hungry individuals otherwise :D

My guess here is start with the small questions. You can motivate the questions by big picture model you’d like to build in the future or by a story you think applies about this data.

I assume also since this is click data there’s probably a bunch of different reasons to be cynical about it (like are the users tagged properly, or do you have id information for more than some small subset of users, or were clicks always meaningful, etc. etc.).

It’s probably worth your time to try to figure out what assumptions you’re making about your data as you use it in your model and you can pass those things along to whoever is interpreting the output so they have them as well. Like maybe some things matter more or less based on what sorts of models you end up building and then that stack of assumptions is useful to someone in the future.