Issues with quitting cmdstan jobs on HPC

LockyW · April 16, 2021, 6:09am

Hi

I’ve been running my models on my Institute’s HPC (with PBS system for job queueing). Essentially submitting a PBS file that runs an R script which calls the model with cmdstan_model(), runs $sample, and saves the fit. However, because some of them were poorly specified and taking too long to run, I killed the jobs using qdel.

I was later notified that some processes associated with my jobs were still on the HPC nodes, using the CPU, despite the jobs being killed.

Can anyone give advice as to exactly what is happening here? i.e. what processes might be remaining, what is an easy way to make sure all of the processes associated with a job are killed if the job fails or is forced to stop? Is there something I can put in my R script or PBS script to stop this happening if I have to force kill jobs again?

Any help would be greatly appreciated.

Cheers

mike-lawrence · April 16, 2021, 3:31pm

On my local linux machine, I sometimes have to killall my_model_name; maybe try that?

yizhang · April 16, 2021, 4:07pm

This happens every once a while when qdel can’t reach the nodes. There should be a qdel -f or qdel -p option in the manual to force kill/purge the job, but must be run as admin.

Topic		Replies	Views
Race conditions between independent cmdstan model runs CmdStan	4	392	July 3, 2023
CmdstanR models fail on HPC cluster when running concurrently on the same node Interfaces	1	405	July 19, 2023
Using Stan on a computing cluster. Any advice? CmdStan	20	5056	January 10, 2019
Running model in a HPC and would like to save intermediate outputs Modeling	19	2550	July 12, 2021
Cmdstan cluster sampling speed CmdStan	3	83	January 10, 2025

Issues with quitting cmdstan jobs on HPC

Related topics