Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Jobs stuck in running state after completion
- Date: Wed, 02 Jun 2010 13:30:40 +0200
- From: "Joan J. Piles" <jpiles@xxxxxxxxx>
- Subject: [Condor-users] Jobs stuck in running state after completion
Hello,
We are having a very strange problem with our condor installation.
We have a pool of ~100 nodes running condor 7.4.1, and both a submitter
node and a central manager node with the same version.
We have found that there are some jobs that complete without error, but
for some reason its status is not updated, and shows as running in
condor_q. The slot shows as Claimed/Idle, and the user is accounted for
this time. The only different thing about the job is that it shows:
TerminationPending = True
In its ClassAd.
In fact, in the processing node there are no processes from that user
(we have a shared UID space), and we can see in the StarterLog:
06/02 12:54:19 Create_Process succeeded, pid=11864
06/02 12:54:59 Process exited, pid=11864, status=0
06/02 12:55:02 Got SIGQUIT. Performing fast shutdown.
06/02 12:55:02 ShutdownFast all jobs.
06/02 12:55:02 **** condor_starter (condor_STARTER) pid 11862 EXITING
WITH STATUS 0
Furthermore, in the submitter node we have in the ShadowLog:
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(NumJobStarts = 1)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(DiskUsage = 1978)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(LastJobLeaseRenewal = 1275476102)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(RemoteSysCpu = 0.000000)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(RemoteUserCpu = 11.000000)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(ImageSize = 5973628)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(ExitBySignal = FALSE)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(ExitCode = 0)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(TerminationPending = TRUE)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(CommittedTime = 52)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(BytesSent = 457629.000000)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:
SetAttribute(BytesRecvd = 1152098.000000)
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Job 38963.2 terminated:
exited with status 0
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): FileLock::obtain(1) -
@1275476113.058234 lock on /xxxxxxxxxxxxxxxxxxxxxxx.log now WRITE
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): FileLock::obtain(2) -
@1275476113.060709 lock on /xxxxxxxxxxxxxxxxxxxxxxx.log now UNLOCKED
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Forking Mailer process...
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): **** condor_shadow
(condor_SHADOW) pid 26318 EXITING WITH STATUS 100
So it seems to end ok.
However, condor_q still shows the job as running.
The only way the job finishes is when, in some kind of cleanup, we find
this kind of lines in the CollectorLog:
06/02 11:49:43 **** Removing stale ad: "< slot1.8@xxxxxxx ,
172.16.3.36 >"
However, this doesn't always happen, and it's mostly random.
Some jobs can't even be removed via "condor_rm" and stay forever in X
state, forcing us to use "condor_rm -forcex" to really remove then.
The strange thing is that the log file shows the jobs as completed, and
the result files are returned correctly.
As it seems some communication problem, we have tried disabling the use
of file locks for the event logging and ignoring NFS errors
IGNORE_NFS_LOCK_ERRORS = True and EVENT_LOG_LOCKING = False) and
switcing to TCP for the communications to the collector daemon, without
success, so we are mostly out of ideas.
Has anybody experienced similar problems or knows which could be the cause?
Thanks,
Joan