[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange Condor Behavior - Possible Bug

Date: Mon, 28 Sep 2015 16:13:17 -0400
From: Michael V Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Strange Condor Behavior - Possible Bug

The first thing I'd suggest is to look at the starter log files for one of the problem jobs, which would be on btbal3600 or 3610 based on your logs below. Looks like maybe you're using partitionable slots? It'd be StarterLog.slot1 if not, or StarterLog.slot1_* if so.

That may give you a bit more insight into why the termination took place. Sounds like there's precious little in the stdout and stderr from what you wrote.

I've seen this sort of thing if a job balloons its memory and gets nailed by the kernel's out-of-memory killer, though your "memory-used" and "memory requested" figure in the log file shows 1709, so that may be unlikely, but if you see "oom" in the /var/log/syslog file then that indicates that the killer was triggered and you can glean the details from the syslog. How much memory do the exec nodes have?

You can implement your 12-hour time limit internally to the job using a periodic_hold or periodic_remove _expression_:

periodic_hold = ( time() - JobCurrentStartDate > 12*$(HOUR) )
periodic_hold_reason = "Job exceeded 12-hour runtime limit."


	Michael V. Pelletier IT Program Execution Principal Engineer 978.858.9681 (5-9681) NOTE NEW NUMBER 339.293.9149 cell 339.645.8614 fax michael.v.pelletier@xxxxxxxxxxxx

Follow-Ups:
- Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
  - From: Deck, William

References:
- [HTCondor-users] Strange Condor Behavior - Possible Bug
  - From: Wempa, Kristofer

Prev by Date: Re: [HTCondor-users] How to know the number of times a job was proposed to the negotiator
Next by Date: Re: [HTCondor-users] How to know the number of times a job was proposed to the negotiator
Previous by thread: [HTCondor-users] Strange Condor Behavior - Possible Bug
Next by thread: Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
Index(es):
- Date
- Thread