Dear Todd,
thank you very much for your additional suggestions!
what version of HTCondor are you using on the execute nodescondor_version
On your execute machine does "condor_config_val BASE_CGROUP" return "htcondor" (which it should by default) ?True, for all nodes (submit, execute, centralmanager)
CentOS 7
What distro of Linux are you using?
Another random thought is to add the following config knob to your config (on your execute nodes, or on all nodes is fine as well):Sadly, even this knob does not solve the issue completely. I.e., condor_rm is able to kill
ÂÂÂ USE_PID_NAMESPACES = True
the job. Nevertheless, since Abaqus does not seemed to be killied via SIGTERM,
there is a lock-file present does prohibits the job rerun without removing
the lock-file (called *.lck).
Actually, I not that I would be able to solve this via a pre-command, but this is another
script that should be considered and maintained.
Do you have further suggestions?
Thanks in advance.
Felix
Final random thought: where any of your still-running Abaqus processes stuck in the "D" (disk IO) state when you look at them with /bin/ps? On Linux, processes stuck on I/O cannot be killed, even with "kill -9". I have seen this happen when, for instance, a job is using a stale/stuck NFS mount....Thanks for this addtional hint. But in my case, I am able to kill the job via SIGTERM manually.
On 1/22/2021 3:22 AM, christoph.beyer@xxxxxxx wrote:
Hi Felix, this is partly a UNIX 'problem' by using exec you replace the previous bash process, exec will never come back but replace the actual process that called it, hence traps you send to the previous process will not be handled/forwarded either. I don't see the necessity for your 2-lin bash script, should not something like: executable = /opt/Abaqus/Commands/abq2017 arguments= job=sim01_NI1100 input=sim01_NI1100.inp user=umat.f inter Be more straight forward ?
In addition to Christoph's suggestion above to simply get rid of the wrapper script, what version of HTCondor are you using on the execute nodes (condor_version will tell you), and was HTCondor installed / running as root on the execute nodes? I ask because with HTCondor v8.8 and above, when started as root HTCondor should be using Linux's control groups (cgroups) mechanism by default to make sure all processes involved with a job get killed --- even 'orphaned' processes like Christoph describes.Â
On your execute machine does "condor_config_val BASE_CGROUP" return "htcondor" (which it should by default) ?
What distro of Linux are you using?
Another random thought is to add the following config knob to your config (on your execute nodes, or on all nodes is fine as well):
ÂÂÂ USE_PID_NAMESPACES = True
This will tell Linux to put each job in its own pid namespace, meaning the job cannot "see" other processes running on the system with things like /bin/ps ... this can cause problems for some very small percentage of applications, but works fine with 95% of apps out there. In your case, another advantage of pid namespaces is it tells the Linux kernel itself to track and kill all processes associated with a job.
Final random thought: where any of your still-running Abaqus processes stuck in the "D" (disk IO) state when you look at them with /bin/ps? On Linux, processes stuck on I/O cannot be killed, even with "kill -9". I have seen this happen when, for instance, a job is using a stale/stuck NFS mount....
Hope the above helps
Todd
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/