[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] pre-kill warning signals to jobs?



Hi all,

a not fully fermented idea, but is there a way in Condor for the startd to send its job a signal on a predefined condition, e.g., for something like a warning when memory utilization is getting near to the requested limit?

Background is, that we are running our users' Jupyter notebooks as Condor jobs, which is somewhat transparent to the users. But notebook users tend to have no relationship to the size of their data they try to squeeze into their notebooks. So, Condor kills notebook jobs when hitting the memory limit - and for the user the notebook looks like to have crashed without obvious reason. While there are memory watcher plugins for notebooks, they can help only so much.

Thus as a raw idea, if one could predefine a warning threshold like "requested-memory - 20%" to send a signal to a job (SIGURG? SIGTRAP??). It would still have to be caught in the notebook "somehow" (including users who would have to catch it in any non vanilla Jupyter kernels created by themselves)

Maybe something from checkpointing could be re-used for something like that?

Cheers and thanks for any ideas,
  Thomas

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature