I’d like to submit a case study, Golf Putting. I’ve placed it all at http://www.stat.columbia.edu/~gelman/temp/
Author: Andrew Gelman
Keywords: sports, workflow
R Package Dependencies: rstan
License: BSD (3 clause), CC-BY
I’d like to submit a case study, Golf Putting. I’ve placed it all at http://www.stat.columbia.edu/~gelman/temp/
Author: Andrew Gelman
Keywords: sports, workflow
R Package Dependencies: rstan
License: BSD (3 clause), CC-BY
I’ve updated the case study and put it at the same web-location as above.
I’m not quite sure what to do next. On the Stan Case studies page, it says: “To contribute a case study, please contact us through the Stan Forums…” I didn’t see any place on Stan Forums specifically for contributing case studies, hence these posts on the General page.
Maybe @breckbaldwin can put it on the web page?
I guess it would be good for us to have more specific directions of where to submit case studies, at least if we want to encourage more submissions.
But then we’ll also need someone to manage the case studies page. It’s kind of like being a journal editor. We’ll need a volunteer or group of volunteers to do this!
I do not have the statistical/Stan chops to function as the case studies editor. Any volunteers?
Breck
I’ll take a look. I’m also looking at Imad’s.
Hi, just to let you know, I updated the case study once again; current version is at http://www.stat.columbia.edu/~gelman/temp/
I may be commenting on an older one. I just saw that it was updated when I went to report.
Very cool case study. Definitely think we should add it.
The classical standard deviations here are standard deviations for the maximum likelihood estimator, right?
In the model, can you say y[j] is the number of successes observed in n[j] trials at distance x[j]?
You could mention that Stan vectorizes. I don’t know how much help you want to give the reader.
I don’t like that printing with comment = ‘##’, but that’s presumably the default.
You need something like options(“width” = 120) in order to avoid that wrapping in the figure. When it renders you’ll be able to scroll over. You could also drop the lp__ by including an argument: pars = c(“a”, “b”).
It’s hard to see the difference between light lines and dark lines. I’d plot the draws, then make sure to plot the other one in a different color over it if you can.
After the description, I espected a comparison involving the angle, not just the bare expression 2 * Phi(...) - 1
.
You’re assuming a lot of trig here that might be helpful to work out. But one thing that you should do is discuss that you’re measuring angle from straight in radians, right, so it’ll be 0 +/- error? And it’s the absolute value of that which needs to be below the threshold.
Oops, looks like you’re talking about degrees in the next plot.
The formula for p[j] doesn’t make sense as j never shows up in it. I think it needs x[j] instead of just x.
Is there a way to put the degrees above the line without the sigma = ? They look odd cut in half by the lines.
I don’t like saying Stan sweeps over values consistent with the posterior—can you just say Stan samples from the posterior? Or that it sweeps over values consistent with the prior (improper) and data, i.e., values drawn from the posterior? Or is there a subtle point you’re trying to make here?
Stray a in “σa” after “in preparation to estimating”
You can combine declaration and definition and should wherever possible:
vector[J] p = 2 * Phi(asin((R - r) ./ x) / sigma) - 1;
...
real sigma_degrees = sigma * 180 / pi();
Plus, we recommend spaces between all operators and unfolding as much as possible to remove parens—it really helps with readability in text.
R never ceases to amaze me with its c and list shenanigans.
For efficiency, asin((R - r) ./ x)
should be defined as trasnformed data. You might just make a note of it if you don’t want to clutter the program with a new block and constant.
You don’t need parens in (a / b) / c
—it’s equal to a / b / c
because /
(like *
and +
and -
) is left associative.
Can you come up with a better name than golf.stan and golf1.stan
? How about golf_logit
, and golf_angle
and golf_angle_distance
?
Won’t logistic regression be better at those long puts as it falls off faster?
Starting with this section, it’s getting confusing how you’re mixing computer notation (* for multiplication) and LaTeX (italic variables), as in the definition of u in the second paragraph of “A new model”). Also, some space before that sigma_distance would help—the rendering of LaTeX by MathJax is rough. Fractions might make it more readable aas they’d let you drop a layer of parens.
Same comment about declare and define in the program. And adding spaces between operators. In Stan, we can line up on next line as in most math typesetting.
There’s a dangling “Here’s the new Stan program:” after you included it a paragraph earlier.
Did you coin the folk theorem usage? If so, it’s odd for you to refer to it as “so-called”!
Same comment about width as earlier.
There’s a dangling $ in third bullet point of “new parameter estimates are:”
There’s a dangline paragraph start “Compared to the”
Is that the same new data as before that didn’t match the old data? Did you fit the new model with the old data? If so, it looks like it fits the new data better even being fit to the old data. Not impossible, but if that’s what’s going on, you should say so in order for everyone else not to wonder what’s up.
One of the other problems in your model is no priors on the sigma. Given that they get close to 0, that can be problematic.
Rather than using to_vector
, you could define the data variables as vectors and then just use ./
.
If you are going to apply to_vector
to constants, you should do it once in the transformed data block so as not to have to construct new vectors and do the division every single evaluation.
Rather than “.*”, you can just use backtick and it’ll render as computer code. That’s how you should be handling inline code anyway in markdown.
This general point about having too much data at one point is important in a lot of models, I think. It seems like we should just be able to throw more data at a problem, but if it doesn’t have the right leverage or whatever you call it, it doesn’t help and can cause you to mess up other estimates. I guess it depends on what the goal is here—get the most new predictions right or try to get a sense of accuracy versus distance. Anyway, good grist for the philosophy mill here when it comes to the “we only care about predictions” philosophy.
You need open source licenses if we’re going to publish them and a link to a repository where people can find the source. We can put the latter in example-models if you don’t already have it in your own organization. I recommend these:
Code © 2019, Trustees of Columbia University in New York, licensed under BSD-3.
Text © 2019, Andrew Gelman, licensed under CC-BY-NC 4.0.
We also generally ask people to run from a clean start and report sessionInfo() so that people will be able to see which versions of R packages (like rstan) you used to generate the output. So that’s just a single block with sessionInfo()
called and reported.
No citations? Was that note about distance on a personal correspondence?
P.S. Here’s how I’d write that final model for efficiency, though I probably wouldn’t combine all those priors into a one-liner—I just thought it’d amuse you until we get the
parameters {
real<lower = 0> sigma_, sigma_b, sigma_c;
...
model {
sigma_a, sigma_b, sigma_c ~ foo(theta)` notation into production.
thing into production. Anyway, here it is:
data {
int<lower = 0> J;
vector<lower = 0>[J] n;
vector<lower = 0>[J] x;
vector<lower = 0>[J] y;
real<lower = 0> r;
real<lower = 0> R;
real<lower = 0> overshot;
real<lower = 0> distance_tolerance;
}
transformed data {
vector[J] x_p_overshot = x + overshot;
vector[J] asins = asin((R - r) ./ x);
}
parameters {
real<lower = 0> sigma_theta;
real<lower = 0> sigma_distance;
real<lower = 0> sigma_y;
}
model {
vector[J] p_angle = 2 * Phi(asins / sigma_theta) - 1;
vector[J] denom = x_p_overshot * sigma_distance;
vector[J] p_distance
= Phi((distance_tolerance - overshot) ./ denom)
- Phi(-overshot ./ denom);
vector[J] p = p_angle .* p_distance;
y ./ n ~ normal(p, sqrt(p .* (1 - p) ./ n + sigma_y^2));
{ sigma_theta, sigma_distance, sigma_y } ~ normal(0, 1);
}
generated quantities {
real sigma_degrees = sigma_theta * 180 / pi();
}
Thanks for the detailed peer review! I posted an update here:
http://www.stat.columbia.edu/~gelman/temp/
In addition to addressing your comments, I’ve made a bunch more changes throughout. I’ve interspersed some comments/questions below.
Andrew
September 16
I don’t like that printing with comment = ‘##’, but that’s presumably the default.
I agree. I find that ## thing ugly too. It’s what happens when I use the command print_file() in R. Is there a better way to do this? I looked at your case study of the predator-prey model, and it had Stan code fragments directly quoted. But I had the impression that for Markdown it was better to directly print the model (that way we know that all the Stan code is runnable, there are no synching problems with what’s in the .Rmd file and what’s in the .stan files. Is there a way to print R output in Markdown without the ## things?
You need something like options(“width” = 120) in order to avoid that wrapping in the figure.
I’m assuming this will be fixed in our new default rstan output that we’ve been talking about, as currently I think our default display is too long. I’ll have to confer with Aki, Ben, and Jonah when Aki’s visiting next month.
In the meantime, I’ve changed the display options for now to keep things clean.
You don’t need parens in
(a / b) / c
—it’s equal toa / b / c
because/
(like*
and+
and-
) is left associative.
That I know but I like to be unambiguous in such settings!
Won’t logistic regression be better at those long puts as it falls off faster?
Only by luck! Once we’ve fit the geometry-based model, I’ll just go with that, and I’m not really interested in the logistic model anymore.
SIs that the same new data as before that didn’t match the old data? Did you fit the new model with the old data? If so, it looks like it fits the new data better even being fit to the old data. Not impossible, but if that’s what’s going on, you should say so in order for everyone else not to wonder what’s up.
I don’t understand your question. We have two datasets. First we fit golf_logistic and golf_angle to the small dataset, then we fit golf_angle to the second dataset and see a problem, then we fit golf_angle_2,3 to the second dataset and get a good fit, then we fit golf_angle_4 to the second dataset but we don’t display the fit, we just discuss it. Anyway, I added a section at the end to clarify this point.
This general point about having too much data at one point is important in a lot of models, I think. It seems like we should just be able to throw more data at a problem, but if it doesn’t have the right leverage or whatever you call it, it doesn’t help and can cause you to mess up other estimates. I guess it depends on what the goal is here—get the most new predictions right or try to get a sense of accuracy versus distance. Anyway, good grist for the philosophy mill here when it comes to the “we only care about predictions” philosophy.
You perhaps won’t be surprised to hear that I think the resolution of this particular issue is to think of this a problem of poststratification. It’s fine to have the goal of optimizing average predictive accuracy, but then the question is, what average are you interested in? An average over the data will put most of the weight on those very short putts where the n’s are largest. An equally weighted average from 0 to 80 feet is another story.
You need open source licenses if we’re going to publish them and a link to a repository where people can find the source. We can put the latter in example-models if you don’t already have it in your own organization.
Yes, please do. I will also put this example in the Workflow book but it’s ok to have it in two places, I guess.
I recommend these:
- Code © 2019, Trustees of Columbia University in New York, licensed under BSD-3.
- Text © 2019, Andrew Gelman, licensed under CC-BY-NC 4.0.
I followed the example of what Sophia et al. did here:
https://mc-stan.org/users/documentation/case-studies/dyadic_irt_model.html
P.S. Here’s how I’d write that final model for efficiency, though I probably wouldn’t combine all those priors into a one-liner—I just thought it’d amuse you until we get the
parameters { real<lower = 0> sigma_, sigma_b, sigma_c; ... model { sigma_a, sigma_b, sigma_c ~ foo(theta)` notation into production.
What we really need is combining the declaration with the distribution as here: real<lower=0> sigma_a ~ foo(theta);
thing into production. Anyway, here it is:
data { int<lower = 0> J; vector<lower = 0>[J] n; vector<lower = 0>[J] x; vector<lower = 0>[J] y;
I agree that changing y and n to vectors simplifies the code. But it seems confusing to me to switch the data types just because we’re changing the model. I’d rather be explicit about what we’re doing to the data to make the approximate model.
The right thing to do, I think, would be to fit the binomial model with an error term that’s a smooth function of x. That’s a lot of work, hence I fudged it with independent errors (as noted at the end of the case study). But to add independent errors to the binomial model is a pain in the ass, because (a) we’re now introducing a new latent variable for each parameter, and (b) we’d have to bring back the logit, or something like it, to keep the probabilities bounded between 0 and 1. That’s a lot of complexity in the model and code, and lots of splainin to do, so I thought the better choice for expository purposes would be to just use the normal approximation.
P.S. I added one more thing to the case study. Updated version (dated 19 Sep 2019) is again here: http://www.stat.columbia.edu/~gelman/temp/
I’m no longer in charge of the web site or case studies, so you’ll need to figure out who this is to get it posted.
print_file <- function(file) {
cat(paste(readLines(file), "\n", sep=""), sep="")
}
But then you need to control the knitr options. I’ve been using something like this for the Tufte format output:
knitr::opts_chunk$set(
include = TRUE,
cache = TRUE,
collapse = TRUE,
echo = FALSE,
message = FALSE,
tidy = FALSE,
warning = FALSE,
comment = " ",
dev = "png",
dev.args = list(bg = '#FFFFF8'),
dpi = 300,
fig.align = "center",
fig.width = 7,
fig.asp = 0.618,
fig.show = "hold",
out.width = "90%"
)
You’ll see comment = " "
, whic indents all the R output two lines. I’ve also used `comment = "| ", which creates what look more like a vertical bar like an HTML quote.
Just writing text out, like in the user’s guide, is dangerous because it might not work or match what’s actually run. On the other hand, just printing the whole model out is unwieldy when it gets big or there are a few minor modifications. I’ve sometimes dumped all the code out as an appendix.
You could also control the quantiles printed. But until RStan changes, you need something that works for the tutorial as writtten.
Associativity and precedence are unambiguous! That’s why we can write a + b * c
and not have it be ambiguous—we know *
binds more tightly. Similarly, when we write a * b * c
, we know it’ll be interpreted as (a * b) * c
. This can matter. If a = 10^300
and b = 10^300
and c = 10^-300
, then a * (b * c)
will produce the mathematically correct answer whereas (a * b) * c
will overflow double precision.
Fair enough. You’re not going to get a bunch of empirically minded ML researchers reading this.
This is why I get confused in these papers. I missed the switch in training data from the golf_angle to golf_angle_2,3.
But how do we do that? Do we reweight the training data or separate into a bunch of little models?
That’s insufficient. I should contact Sophia, but I’m tired of playing license police.
License notices need to indicate who owns the copyright. Like me, Columbia owns the copyright to all your code (but they’ve given us a perpetual right to release under open source), whereas you own the copyright to all your text. Also, the licensing is different for text and code. In the past, we’ve wanted the text to be open-sourced, too. Did you explicitly not want to do that? This should also be up to whoever’s in charge of the web site.
:-) That’s a discussion for a different thread.
Hi, thanks. It looks like we’ll be having a case studies director soon so I guess that new person will be able to handle my submission. I don’t think we need all the case studies in anything like a uniform format. All that’s really necessary is that they’re all in markdown, that they all run and that someone has looked each one over so we don’t have anything really horrible showing up.
Also we have someone translating some of the case studies from R to Python so I hope that will be appearing soon.
Some of them are in the form of Jupyter notebooks and we’d probably take one in the form of shell scripts running CmdStan if someone wanted to submit one.
We can post it whenever you’re ready—we don’t need to wait to pay someone else to do it. I just didn’t put it up because (a) I thought you were still revising it because I’m still seeing pull requestions, and (b) I was hoping someone else could do it. But it really only takes me 10 minutes or so to do these if the HTML and licensing is in order.
Put comment = NA
into the R chunk header.
The case study is now up:
https://mc-stan.org/users/documentation/case-studies/golf.html
Thanks, all, for the help.
@andrewgelman: I would very much like to maintain a web policy where we do not distribute anything we do not have a license to distribute. As is, there is no license allowing us to distribute the text of your case study.
I’m not in charge of the web, so I’m not going to rip it out for non-compliance, but I would very much appreciate it if you updated this to include a license for the text. I don’t care what that license is as long as it allows us to distribute your case study.
I believe @breckbaldwin is still nominally in charge of our web site.
@andrewgelman quite popular choice for the text license seems to be
licensed under CC-BY-NC 4.0
(allows derivative work but with BY = need to keep your name, NC = non-commercial)
Thanks. My case study currently says, “All code in this document is licensed via the BSD 3-clause license.”
I can replace it by “licensed under CC-BY-NC 4.0”.
Not replace, but add. You need separate license for code and text. So you would have
“All code in this document is licensed under BSD 3-clause license and all text licensed under CC-BY-NC 4.0”
Thanks. I fixed, and now I know what to do in future case studies.