Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Jobs getting held for no obvious reason
- Date: Thu, 20 Nov 2008 10:01:20 -0000
- From: "Smith, Ian" <I.C.Smith@xxxxxxxxxxxxxxx>
- Subject: [Condor-users] Jobs getting held for no obvious reason
Hi,
I've noticed that a lot of jobs on our pool are being held for no obvious
reason. It seems to happen to the longer running jobs ( > 1 day )
but there's no apparent pattern. The hold reason is given as typically:
HoldReason = "Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER at
138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
138.253.100.27 failed to write to file
/opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno 13)
Permission denied"
and the job log file shows:
Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER at
138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
138.253.100.27 failed to write to file
/opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno 13)
Permission denied
58428532 - Run Bytes Sent By Job
178835 - Run Bytes Received By Job
...
012 (9648.000.000) 11/07 17:08:37 Job was held.
Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER at
138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
138.253.100.27 failed to write to file
/opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno 13)
Permission denied
Code 12 Subcode 13
But the directory in question is there and the permissions are OK. I'm running
the central manager/submit
host on a Sun V440 with Solaris 10. Execute hosts are all Win XP SP2 and
everything is Condor 7.0.2.
As I workaround I placed this is in the config file:
#ICS workaround for "failed to write to file ... permission denied problem"
#ICS release the job upto 10 times if on hold for over 10 minutes
SYSTEM_PERIODIC_RELEASE = (JobRunCount < 10 && CurrentTime -
EnteredCurrentStatus > 600) &&\
(HoldReasonCode == 12 || HoldReasonSubCode == 13)
but as far as I can see the jobs aren't getting released automatically.
Any help would be most appreciated - this has me baffled.
-ian.
--------------------------------------------
Dr Ian C. Smith,
e-Science Team,
The University Of Liverpool,
Computing Services Department,