[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] return code of jupyter notebook jobs



On 3/27/2020 5:25 AM, Beyer, Christoph wrote:
Hi,

as we use jupyter notebooks running in condor slots in production for a while now we need to get a bit of monitoring around this.

One of the bigger problems to come up with something decent is that the jupyterhub uses condor_rm to end the notebook once it is not needed anymore. This results in a condor_history entry with jobstatus == 3 which is considered to be a faulted job (which in fact in this case it is not). The other option is that the notebook job runs into the timelimit and gets removed by the periodic_remove_expression which is a bit more flexible to tweak presumably.

I would like the idea of having an option for condor_rm to influence the subsequent history-job-state.
I think your idea, whereby condor_rm can influence subsequent history-job-state, is on target.  Please note that 
condor_rm takes a "-reason <string>" argument, which allows you to set the RemoveReason job attribute at the time of 
removal.  This RemoveReason attribute will also be in the history. The Python API also supports setting a removal reason 
at the time of job removal.
Does this help?

best regards,
Todd