Hi,

I’m fairly new to probabilistic modeling and need feedback on modeling a scenario.

Scenario:

People are assigned to variable number of sites to visit. They are given a set number of days to visit each site (t days). Let ‘d’ be the number of days taken by an individual to visit the site. If d <= t, then deviation (dev) = 0, if d>t, dev = (d-t)/t. The average_deviation = average(deviations across sites the individual visited)

Let p1, p2, p3 ……p1000 be the individuals who are assigned to visit a number of sites. Individuals are typically assigned to visit between 5 and 200 sites.

Let s1, s2, s3, s4….s5000 be the sites. These belong to different types of classes with C1 = {s1,s2,s3…s10}, C2 = {s11,s12}, C3 = {s14,……s50}, C4 = {s51,s52,s53}, and so on. The classes are mutually exclusive. Total number of classes ~ 200. Since a site can be assigned to multiple individuals, the average_deviation at the site level can be calculated which can then be rolled up to average_deviation per class.

The response variable is average deviation for each person.

To instantiate the scenario:

A person p1 is assigned 6 sites s1, s2, s3, s11, s12 and s14. If the deviations were 0, 0, 0.15, 0.2, 0, 2 - the average_deviation of p1 = 0.39. The sites come from 3 classes {C1, C2, C3} which have different average_deviations.

It is important to contrast p1 from another person ‘p2’ who visited say 50 sites and has the same average_deviation. I think the right way to model the average_deviation per person is to account for the number of sites visited and the average_deviation at class level. The deviation rate at the site level is not important.

Data observation: Most of the average_deviations of individuals are 0.

My attempt to model:

Average_deviation_1 (Person p1) ~ Exponential(r_1)

r_1 = logit(u)

u = a_1 + num_sites_C1 *b1 + num_sites_C2 *b2 + num_sites_C2 *b2

a_1 ~ Uniform(0,1)

b1 ~ Exp(0.1)

b2 ~ Exp(0.1)

b3 ~ Exp(0.1)

num_sites_C1 = number of sites the person visited that were from class C1, and so on for C2 and C3

I am wondering if this is the right way to model this scenario or I’m making rookie mistakes. May be there could be a better way to explicitly use the number of sites visited by the individual.

Thank you for the guidance.

I wasn’t sure what you mean by “deviation” here or why you’re trying to calculate averages.

What is the data you observe and what are you trying to estimate or predict?

The ‘deviation’ is a measure of ‘lateness’ - of how late a person was in visiting a site.

The average is being calculated to obtain a measure of performance - it is not the number of times (k/n) the person is late visiting sites that is important but how late the person was on an average during all of the visits.

If the site had to be visited in 30 days and the person took 45, the deviation is (45-30)/30 = 0.5. If the site was visited in <= 30 days, deviation=0 as it confirmed to protocol. The average_deviation is the average of all deviations across sites the person visited.

I observe the number of days a person took to visit a site and the number of sites allocated. Using this I calculate the deviation at each site and the average_deviation (example of person ‘P1’ in earlier post).

A site may be allocated to multiple individuals. An individual may be required to visit a site multiple times.

I also know the membership of sites. They belong to 200 disjoint sets (Classes). Since I see the number of deviations for a site, I’m able to calculate the number of deviations and average_deviations of classes. So for each class, I observe the number of deviations and the average_deviation.

I want to predict the average_deviation of a person given the number of sites the person visited and the classes from which the sites were sampled.

For example,

Case1 - Average_deviation = 0.5 for person P1 visiting 6 sites sampled from classes C1, C2, C3

Case2 - Average_deviation = 0.49 for person P1 visiting 50 sites sampled from classes C1, C2, C3

Case3 - Average_deviation = 0.51 for person P1 visiting 150 sites sampled from classes C1, C2, C3, C4, C5

The 3 cases have similar average_deviation but different number of site visits. the average_deviation in Case2 is more credible compared to Case1 based on the number of site visits (50 vs 6).

Case3 has more number of visits but sites are sampled from multiple classes.

Getting posteriors in the 3 cases will be helpful to see the observed average_deviation in perspective.

Half of observed average_deviations across ~2000 people = 0. The average_deviation of classes is ~0 in 40% of classes, so may be an exponential distribution with 0.2 < lambda < 0.4 an appropriate prior?

Sorry for being too wordy, but wanted to make sure the scenario is clear.

Let p1, p2, p3 ……p2000 be the individuals who are assigned to visit a number of sites.

Let s1, s2, s3, s4….s5000 be the sites. These belong to different types of classes with C1 = {s1,s2,s3…s10}, C2 = {s11,s12}, C3 = {s14,……s50}, C4 = {s51,s52,s53}, and so on.

n be number of sites each person visited, N = {n1, n2, n3 …n2000}

nc be number of deviations (data points) in each class, NC = {nc1, nc2 … nc2000}

d be the average_deviation observed for individuals, D = {d1, d2, d3 … d2000}

dc be average_deviation in each class, DC = {dc1, dc2, dc3 … dc2000}

If a person P1 made 10 visits, 3 from C1, 3 from C2 and 4 from C3, what is the posterior distribution of average_deviation.

Going over examples, it looks like it should be modeled as a mixture of exponentials, but still cannot figure out how the number of visits can be taken into account.

Thanks.