Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm

Date: Mon, 25 Jan 2021 21:54:33 +0100
From: felix.koelzow@xxxxxx
Subject: Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm

Dear Todd,

thank you very much for your additional suggestions!

what version of HTCondor are you using on the execute nodes

condor_version
$CondorVersion: 8.8.12 Nov 24 2020 BuildID: 524104 PackageID: 8.8.12-1 $
$CondorPlatform: x86_64_CentOS7 $

On your execute machine does "condor_config_val BASE_CGROUP" return "htcondor" (which it should by default) ?

True, for all nodes (submit, execute, centralmanager)

What distro of Linux are you using?

CentOS 7

Another random thought is to add the following config knob to your config (on your execute nodes, or on all nodes is fine as well):

ÂÂÂ USE_PID_NAMESPACES = True

Sadly, even this knob does not solve the issue completely. I.e., condor_rm is able to kill

the job. Nevertheless, since Abaqus does not seemed to be killied via SIGTERM,

there is a lock-file present does prohibits the job rerun without removing

the lock-file (called *.lck).

Actually, I not that I would be able to solve this via a pre-command, but this is another

script that should be considered and maintained.

Do you have further suggestions?

Thanks in advance.

Felix

Final random thought:Â where any of your still-running Abaqus processes stuck in the "D" (disk IO) state when you look at them with /bin/ps?Â On Linux, processes stuck on I/O cannot be killed, even with "kill -9".Â I have seen this happen when, for instance, a job is using a stale/stuck NFS mount....

Thanks for this addtional hint. But in my case, I am able to kill the job via SIGTERM manually.

On 22/01/2021 18:46, Todd Tannenbaum wrote:

On 1/22/2021 3:22 AM, christoph.beyer@xxxxxxx wrote:
Hi Felix,

this is partly a UNIX 'problem' by using exec you replace the previous bash process, exec will never come back but replace the actual process that called it, hence traps you send to the previous process will not be handled/forwarded either.

I don't see the necessity for your 2-lin bash script, should not something like:

executable = /opt/Abaqus/Commands/abq2017
arguments= job=sim01_NI1100 input=sim01_NI1100.inp user=umat.f inter

Be more straight forward ?
In addition to Christoph's suggestion above to simply get rid of the wrapper script,Â what version of HTCondor are you using on the execute nodes (condor_version will tell you), and was HTCondor installed / running as root on the execute nodes?Â I ask because with HTCondor v8.8 and above, when started as root HTCondor should be using Linux's control groups (cgroups) mechanism by default to make sure all processes involved with a job get killed --- even 'orphaned' processes like Christoph describes.Â

On your execute machine does "condor_config_val BASE_CGROUP" return "htcondor" (which it should by default) ?

What distro of Linux are you using?

Another random thought is to add the following config knob to your config (on your execute nodes, or on all nodes is fine as well):

ÂÂÂ USE_PID_NAMESPACES = True

This will tell Linux to put each job in its own pid namespace, meaning the job cannot "see" other processes running on the system with things like /bin/ps ... this can cause problems for some very small percentage of applications, but works fine with 95% of apps out there.Â In your case, another advantage of pid namespaces is it tells the Linux kernel itself to track and kill all processes associated with a job.

Final random thought:Â where any of your still-running Abaqus processes stuck in the "D" (disk IO) state when you look at them with /bin/ps?Â On Linux, processes stuck on I/O cannot be killed, even with "kill -9".Â I have seen this happen when, for instance, a job is using a stale/stuck NFS mount....

Hope the above helps
Todd
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
  - From: felix . koelzow
- Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
  - From: christoph . beyer
- Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
Next by Date: [HTCondor-users] RemoteWallClockTime doesn't reset when failed job reruns
Previous by thread: Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
Next by thread: Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Abaqus + JobWrapper: Unable to kill job via condor_rm