Issues with quitting cmdstan jobs on HPC

Hi

I’ve been running my models on my Institute’s HPC (with PBS system for job queueing). Essentially submitting a PBS file that runs an R script which calls the model with cmdstan_model(), runs $sample, and saves the fit. However, because some of them were poorly specified and taking too long to run, I killed the jobs using qdel.

I was later notified that some processes associated with my jobs were still on the HPC nodes, using the CPU, despite the jobs being killed.

Can anyone give advice as to exactly what is happening here? i.e. what processes might be remaining, what is an easy way to make sure all of the processes associated with a job are killed if the job fails or is forced to stop? Is there something I can put in my R script or PBS script to stop this happening if I have to force kill jobs again?

Any help would be greatly appreciated.

Cheers

1 Like

On my local linux machine, I sometimes have to killall my_model_name; maybe try that?

1 Like

This happens every once a while when qdel can’t reach the nodes. There should be a qdel -f or qdel -p option in the manual to force kill/purge the job, but must be run as admin.

2 Likes