I have been thinking about an idea, and I hope to get some feedback about whether or not it is worth pursuing and whether people think it could be helpful. The idea is to create a tutor-like conversational assistant for using Stan powered by GPT-4 (If I can access their API).
I am sure people have played with ChatGPT asking questions about Stan. While I am unsure of the quality and accuracy of the solutions provided, ChatGPT already responds to these questions in one way or another. I assume these responses are based on the information available on their training data, whatever data they could pull from the open web about Stan.
I also assume the posts on this forum are not included in their training data because it requires login. So, ChatGPT may be missing a good amount of information presented in this forum.
Someone can potentially create a dataset of all these posts and replies in this forum and then feed this data to fine-tune the language models behind ChatGPT-like tools. I assume the fine-tuned model will provide more accurate and valuable solutions (as taught by several experts in this forum). Such a tool can even be integrated into this platform to automatically generate an initial response to each post opened (and moderated by humans). Or, there might be a separate channel or platform for asking questions to StanGPT.
One can easily scrape all available data from any post using a simple query like the following.
For instance, the content of this post can be seen using the following link (with post id 30784):
So, collecting data is not an issue as long as the owners and moderators of this forum are okay with scraping this website’s content with an automated script.
I am curious about what other people think about this idea. Maybe, someone has already started doing something like this.
Very cool idea! I’d love if we had this on the main Stan website.
I doubt anyone has a problem with scraping all the posts. Everything is public.
Is there a way to ensure that the chatbot uses the most up to date syntax?
Anyway, even if we say that it can sometimes get things wrong or out of date, there is still utility in it being able to answer nearly all basic questions.
agree with @spinkney that getting most up-to-date Stan syntax is important. I was playing around with the free chatGPT and asked it to write a Stan model. it produced a completely correct model, although a few too many local blocks used in the model block - I’m not sure why, but I’ve seen this in other old Stan code.
in addition to the forums, StanGPT should be trained on the Stan docs, and as many teaching examples and case studies as can be found on the web.
have you tried wholesale scraping of Discourse? in the past I looked at meta.discourse.org to try and figure out how to use the API - it seemed limited.
FYI, I decided to pay for the premium chatGPT, so if anyone has prompts on which they want to see how gpt4 fares (free users can only access gpt3.5), let me know!
Good idea! Maybe posteriordb would be useful for this (@mans_magnusson), though I guess you’d want a group of models that have both perfectly-fine topology and a bunch for each of the classes of pathological topologies you think should be auto-detectable, and I think posteriordb is mostly the former.
Washington Post discusses C4 dataset used by Google and Facebook for LLMs (GPT-3 uses 40x more data, and GPT-4 data set size is unknown). The article has also a search box for checking which webpages are included and thus violating e.g. CC-BY-NC-ND licenses https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
As shown by the screenshots, the C4 dataset includes 70k tokens from the Stan discourse. It’s then very likely that GPT has at least these and possibly more from Stan discourse.
C4 dataset also includes 13k tokens from mc-stan.org, but a lot of the material there has a license that is not allowing derivative work (CC ND) or commercial (CC NC) use or requires attribution (CC BY), so it’s possible that distributing Stan code generated by GPT could lead to violation of those licenses. Currently, ChatGPT doesn’t provide a way to check what is the original source of the code.