New type of non-central hypergeometric model not recovering parameter

Continuing the discussion from A different kind of non-central hypergeometric distribution?:

I generate data for a “Sequential Categorical Model” as follows:
(in python)

import numpy as np
num_skus = 10
true_weights = softmax(np.random.normal(0,1,size=(num_skus,)))
num_selected_events = 100000

available_sku_indicies_this_selection = np.zeros((num_selected_events, num_skus),dtype=int)
number_available_skus_this_selection = []
selected_indicies_array = []

number_times_available = np.zeros((num_skus,),dtype=int)
number_times_selected = np.zeros((num_skus,),dtype=int)

for i in range(num_selected_events):
    # pick number of skus that are available - at least 2
    n = max(2,int(num_skus*np.random.beta(20,10)))
    number_available_skus_this_selection.append(n)
    # pick which skus are available
    skus = np.sort(np.random.choice(num_skus,n,replace=False))
    number_times_available[skus] += 1
    available_sku_indicies_this_selection[i,:n] = [x+1 for x in skus]
    # reweight probabilites
    p = softmax(true_weights[skus])
    s = np.random.choice(skus,p=p)
    number_times_selected[s] += 1
    selected_indicies_array.append(np.where(skus==s)[0][0]+1)

prior_vector = np.log(number_times_selected/np.sum(number_times_selected))

which selects a subset of the 10 items and renormalizes the true_weights to determine the selection probability.

With this data

stan_data = {
    'num_skus': num_skus,
    'num_selected_events': num_selected_events,
    'available_sku_indicies_this_selection': available_sku_indicies_this_selection,
    'number_available_skus_this_selection': number_available_skus_this_selection,
    'selected_indicies': selected_indicies_array,
}

and the following Stan model:

data {
  int<lower=1> num_skus;
  int<lower=1> num_selected_events;
  int<lower=0, upper=num_skus> available_sku_indicies_this_selection[num_selected_events, num_skus]; // padded with zeros
  int<lower=1> number_available_skus_this_selection[num_selected_events];
  int<lower=0,upper=num_skus> selected_indicies[num_selected_events]; //padded with zeros
}
parameters {
  vector[num_skus] log_weights;
}
model {
  log_weights ~ std_normal();

  for (n in 1:num_selected_events) {
      target += categorical_logit_lpmf(
          selected_indicies[n] | 
          log_weights[available_sku_indicies_this_selection[n,1:number_available_skus_this_selection[n]]]
      );
  }
}

everything fits fine, but the weights that are recovered are not the generative true_weights but instead
softmax(prior_vector).

Can anyone see a discrepancy in the model? I am doing inference on the exact data generating process, but not recovering the true_weights is this a problem with the model, or something else?

cc @martinmodrak

2 Likes

I don’t think your Stan model corresponds to this line. At each selection you have a reweighted probability.

I think the log_weights need to be (first exponentiated) softmaxed (then logged) at each n.

Is the program that @martinmodrak wrote in R the same to generate the data? If I can get the generative data in R, I can take a look more.

Yes that should generate the data similarly to my python code

oh! good catch! yes you are correct about the differences. fixed it!

That was it, I really appreciate the note!

2 Likes