Modeling Relationships Between Clicks and Bookings across Experiments

I’m developing a statistical model to understand how changes in website clicks (which are necessary for later bookings) relate to actual bookings in various experiments. My goal is to create a generative model.

To start, I’ve been simulating data in R to reflect possible outcomes from different experimental actions on clicks and bookings. This is an attempt to mirror real-life variations in these metrics. Here is the initial code for my simulation:

library(tidyverse)

# Simulation Parameters
n <- 1000  # Number of observations
se_click <- 50  # Standard error for clicks
se_book <- 40  # Standard error for bookings
conversion_from_click_to_book <- 0.5  # Average conversion rate from clicks to bookings

# Simulated Data
simulated_data <- tibble(
  real_click = rnorm(n, 0, 100),  # Actual clicks (normally distributed)
  observed_click = rnorm(n, real_click, se_click),  # Observed clicks with added noise
  
  real_book = rnorm(n, conversion_from_click_to_book * real_click, 30),  # Actual bookings (based on clicks)
  observed_book = rnorm(n, real_book, se_book)  # Observed bookings with added noise
)

I think my current simulation lacks a way to show that clicks and bookings are still related through noise, as they should be correlated even without a direct impact. This idea is supported by this paper. I’m looking for advice on how to improve my model to accurately reflect these relationships. Any feedback on my approach and the correctness of my current model would be very helpful. Thanks!

P.S. I hope it’s okay that I’m posting this here since I’m not using STAN yet. Let me know if I should (re)move the post. Thanks!

Sorry this has taken so long to respond to.

Of course—we try not to censor what people post as long as it’s not rude. We just don’t have a lot of time to answer these involved modeling questions.

I don’t understand your simulation because presumably clicks are discrete, not continuous. If you really have the model you specified with simulated data, that’s trivial to convert to Stan. You’d want to put prints on real_click, se_click, real_book, se_book.

This is an example of a noisy measurement problem where you get unbiased normally-distributed errors. There’s a chapter in the Stan User’s Guide on measurement error modeling. You want to include parameters for conversion rate, presumably, but I don’t know if other things are things you know or not. The Stan model would look like this:

data {
  int<lower=0> N;
  vector[N] observed_click;
  vector[N] observed_book;
  real<lower=0> se_obs_click;
  real<lower=0> se_obs_book;
  real<lower=0> se_real_book;
}
parameters {
  real<lower=0, upper=1> conversion_rate;
  real<lower=0> real_click;
  real<lower=0> real_book;
}
model {
  conversion_rate ~ beta(5, 5);  // or some other prior---could be uniform
  observed_click ~ normal(real_click, se_obs_click);
  observed_book ~ normal(real_book, se_obs_book);
  real_book ~ normal(conversion_rate * real_click, se_real_book);
}

Right. If you wanted to estimate the standard errors, you wouldn’t have enough information.

It’s not so much that models are correct or incorrect as much as their being useful or not.

I didn’t see the connection. The paper seems to be about estimating treatment effects (i.e., causality).

P.S. It’s “Stan”, not “STAN” because it’s not an acronym.