OpenCL Async performance

I’ve been reading a little more about OpenCL’s async facilities and they suggest using multiple command queues for the best performance, but I think we’re only using 1 (and have it hardcoded in a global singleton). Does anyone know how much additional performance we’d get from using multiple queues or what that implementation might look like? Tagging usual GPU suspects @rok_cesnovar and @stevebronder :)

references:

Yeah this is a thing I see pretty frequently. I think that advice is for older devices that do not support async within a queue. That’s why the above and stack overflow Qs mentioning this are from 8-10 years ago.

Async works within a queue, so if the device doesn’t support it (like in older systems) the way to get around not having actual async was by having queues for reading and queues for writing. We could due this, but then it’s just more queue management overhead.

idk if Rok has more thoughts but that’s my general understanding

1 Like

That is my understanding also yes. This is a workaround if async is not supported which was more common back then.

1 Like