Enabling LLMs on discourse

Should we enable AI on our discourse?

I think yes. Especially the Forum Helper persona which is a “General purpose AI bot capable of performing various tasks. Can search your current Discourse instance and use that information to build responses”.

If the majority are ok with this then we need an API key or self-host one of the LLMs which is all explained at Discourse AI - AI bot - Site Management - Discourse Meta.

@SGB

3 Likes

I also think yes! What would be a reason to say no?

1 Like

I guess this would lead to the data here being ingested into whatever system. The question to follow up on is if this is aligned with the policies of the forum or would everyone need to consent to this before it is turned on?

In case it is not aligned with the forum policies then users should be allowed to somehow opt out (remove their posts or whatever).

2 Likes

The case of user control of their data is a good point. We can self host an LLM. And by “we” I mean if flatiron does it. That way we’re in control of the data. We can say that this will not be used outside of the LLM server. However, I believe openai and other companies are already crawling discourse and ingesting this data.

From the link, it looks like the personas would not train on community data.

Is the AI Bot trained on my community data?

The AI Bot is not trained on any data. It uses Retrieval-augmented generation (RAG) technique to get results.

Even so, I would vote against adding any AI features. I think this community is small and focused enough that the basic search functionality has been more than enough for me to find the information I’m looking for. But it would be helpful to have a better understanding of the personas’ scopes. Maybe someone involved in a Discourse instance that has implemented these could provide some clarification?

For example, what is the scope of the Forum helper?

Forum Helper: General purpose AI bot capable of performing various tasks. Can search your current Discourse instance and use that information to build responses
i.e - What are the top posts for AI?

Does “build responses” mean that it can suggest text that a user can post? Or does that refer to the building AI’s responses to user prompts for the purpose of summarizing posts? I am leery of adding tools that generate posts for users, but would be more open to improved summary tools.

Here’s a few thoughts from my experience with this at work (this is my first post here but I’ve been lurking for a while):

I’m on a BI dev team and we’re using LLMs on forums at work, similar to OP’s suggestion except it’s all internal. It’s been helpful for automatically answering simpler questions from users with less technical experience (stuff they could have figured out on their own but they aren’t programmers and thus aren’t used to searching technical docs for example). If (as mentioned above) the community is small enough that veteran users are able to respond to queries without too much effort, hosting an LLM may not add much yet though.

2 Likes

We are not trying to block bots from training on our forums. Personally, I hope the LLMs can learn how to be better at Stan!

I’m not even sure it’s possible with Discourse. They have a thread, but it devolved into the usual amateur philosophy of mind about whether LLMs really understand.

My suggestion is that if you really don’t want your text ingested by bots, don’t put it up on the web. We can ask bots to not slurp up our discourse, but that doesn’t mean they’re going to listen. And we hardly have enough money or time to sue them.

We don’t have enough compute to do this in a useful way. Also, we don’t have the people to engineer such a web site.

2 Likes

Thanks for unlurking and sharing your experience and welcome to the Stan forums!

I’m personally a huge fan of LLMs and I use them for everything these days. I’m always surprised when people asking for free help for free software want to protect the IP in their questions or the answers they provide to others. But then I’m the one who insisted on a BSD license for Stan code and a CC license for the documentation, so people could use them for free with minimal strings attached.

1 Like

Thanks for the welcome message!

Likewise, LLMs are my go to these days when automating tasks such as figuring out tricky LaTeX syntax or writing data viz scripts (for the common libraries such as matplotlib or ggplot). The top LLMs are simply better and faster than I am in those situations and end up saving me both time and frustration.

I haven’t asked an LLM for coding assistance with Stan since earlier this year, but that’s mainly because I’ve found the Stan documentation to be very helpful as is. While this goes a bit beyond the scope this thread, I imagine an LLM fined tuned on texts such as BDA3, the Stan Docs, and this forum (to mention a few sources), would actually make for a very educational study buddy, one that I know I’d be interested in working with.

1 Like

I have no idea whether an LLM is needed on this forum, but I do think blocking the use of the forum material for training would be bad, even if it could be done. (One could deny the access of OpenAI, Google and Anthropic crawlers in robots.txt if one had control of the file.)

I use LLMs actively, but for Stan and brms only a little, mostly because my work has been elsewhere, and at least brms maybe starts to be at the edge of knowledge of LLMs, in terms of popularity and training material. OpenAI models seem to be able to write skeletons of Stan programs just fine (including the likelihood part for LOO and such), which is nice. Also translating to numpy seems to be somewhat possible.

Just yesterday I asked brms help from several models. None of them seemed to know that the variance formula has log-link as default, but o1-preview did just fine after I reminded it of the log link. (I kind of know the answer but always forget details because I don’t use this feature so often.)

In future those o1 reasoning models and their equivalents will be very helpful in coding, so imo all training material for them would be welcome.

2 Likes

I can generate an API key for Stan using my openai account and we can use that for a test period (let’s say one month). The llm will be from them, I’m guessing chatgpt.

The self-hosting thing isn’t that much compute because you’re just doing inference not training. I don’t know what kind of specs are suitable but I’m sure someone on here can give us an idea.

I’m a discourse admin so I can enable this but I want to make sure the community is ok with it. I propose that if the consensus is a (cautious) yes that we do a trial period of one month and then see if we want to continue.

3 Likes