# New type of non-central hypergeometric model not recovering parameter

Continuing the discussion from A different kind of non-central hypergeometric distribution?:

I generate data for a “Sequential Categorical Model” as follows:
(in python)

``````import numpy as np
num_skus = 10
true_weights = softmax(np.random.normal(0,1,size=(num_skus,)))
num_selected_events = 100000

available_sku_indicies_this_selection = np.zeros((num_selected_events, num_skus),dtype=int)
number_available_skus_this_selection = []
selected_indicies_array = []

number_times_available = np.zeros((num_skus,),dtype=int)
number_times_selected = np.zeros((num_skus,),dtype=int)

for i in range(num_selected_events):
# pick number of skus that are available - at least 2
n = max(2,int(num_skus*np.random.beta(20,10)))
number_available_skus_this_selection.append(n)
# pick which skus are available
skus = np.sort(np.random.choice(num_skus,n,replace=False))
number_times_available[skus] += 1
available_sku_indicies_this_selection[i,:n] = [x+1 for x in skus]
# reweight probabilites
p = softmax(true_weights[skus])
s = np.random.choice(skus,p=p)
number_times_selected[s] += 1
selected_indicies_array.append(np.where(skus==s)[0][0]+1)

prior_vector = np.log(number_times_selected/np.sum(number_times_selected))
``````

which selects a subset of the `10` items and renormalizes the `true_weights` to determine the selection probability.

With this data

``````stan_data = {
'num_skus': num_skus,
'num_selected_events': num_selected_events,
'available_sku_indicies_this_selection': available_sku_indicies_this_selection,
'number_available_skus_this_selection': number_available_skus_this_selection,
'selected_indicies': selected_indicies_array,
}
``````

and the following Stan model:

``````data {
int<lower=1> num_skus;
int<lower=1> num_selected_events;
int<lower=0, upper=num_skus> available_sku_indicies_this_selection[num_selected_events, num_skus]; // padded with zeros
int<lower=1> number_available_skus_this_selection[num_selected_events];
}
parameters {
vector[num_skus] log_weights;
}
model {
log_weights ~ std_normal();

for (n in 1:num_selected_events) {
target += categorical_logit_lpmf(
selected_indicies[n] |
log_weights[available_sku_indicies_this_selection[n,1:number_available_skus_this_selection[n]]]
);
}
}

``````

everything fits fine, but the weights that are recovered are not the generative `true_weights` but instead
`softmax(prior_vector)`.

Can anyone see a discrepancy in the model? I am doing inference on the exact data generating process, but not recovering the `true_weights` is this a problem with the model, or something else?

2 Likes

I don’t think your Stan model corresponds to this line. At each selection you have a reweighted probability.

I think the `log_weights` need to be (first exponentiated) softmaxed (then logged) at each n.

Is the program that @martinmodrak wrote in R the same to generate the data? If I can get the generative data in R, I can take a look more.

Yes that should generate the data similarly to my python code

oh! good catch! yes you are correct about the differences. fixed it!

That was it, I really appreciate the note!

2 Likes