Subject: Re: [HTCondor-users] Strange Condor Behavior - Possible Bug
The first thing I'd suggest is to look
at the starter log files for one of the problem jobs, which would be on
btbal3600 or 3610 based on your logs below. Looks like maybe you're using
partitionable slots? It'd be StarterLog.slot1 if not, or StarterLog.slot1_*
if so.
That may give you a bit more insight
into why the termination took place. Sounds like there's precious little
in the stdout and stderr from what you wrote.
I've seen this sort of thing if a job
balloons its memory and gets nailed by the kernel's out-of-memory killer,
though your "memory-used" and "memory requested" figure
in the log file shows 1709, so that may be unlikely, but if you see "oom"
in the /var/log/syslog file then that indicates that the killer was triggered
and you can glean the details from the syslog. How much memory do the exec
nodes have?
You can implement your 12-hour time
limit internally to the job using a periodic_hold or periodic_remove _expression_:
Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax
michael.v.pelletier@xxxxxxxxxxxx